{"title": "Feature selection in functional data classification with recursive maxima hunting", "book": "Advances in Neural Information Processing Systems", "page_first": 4835, "page_last": 4843, "abstract": "Dimensionality reduction is one of the key issues in the design of effective machine learning methods for automatic induction.  In this work, we introduce recursive maxima hunting (RMH) for variable selection in classification problems with functional data. In this context, variable selection techniques are especially attractive because they reduce the dimensionality, facilitate the interpretation and can improve the accuracy of the predictive models. The method, which is a recursive extension of maxima hunting (MH), performs variable selection by identifying the maxima of a relevance function, which measures the strength of the correlation of the predictor functional variable with the class label. At each stage, the information associated with the selected variable is removed by subtracting the conditional expectation of the process. The results of an extensive empirical evaluation are used to illustrate that, in the problems investigated, RMH has comparable or higher predictive accuracy than standard simensionality reduction techniques, such as PCA and PLS, and state-of-the-art feature selection methods for functional data, such as maxima hunting.", "full_text": "Feature selection in functional data classi\ufb01cation with\n\nrecursive maxima hunting\n\nJos\u00b4e L. Torrecilla\n\nComputer Science Department\n\nUniversidad Aut\u00b4onoma de Madrid\n\n28049 Madrid, Spain\n\nAlberto Su\u00b4arez\n\nComputer Science Department\n\nUniversidad Aut\u00b4onoma de Madrid\n\n28049 Madrid, Spain\n\njoseluis.torrecilla@uam.es\n\nalberto.suarez@uam.es\n\nAbstract\n\nDimensionality reduction is one of the key issues in the design of effective machine\nlearning methods for automatic induction. In this work, we introduce recursive\nmaxima hunting (RMH) for variable selection in classi\ufb01cation problems with func-\ntional data. In this context, variable selection techniques are especially attractive\nbecause they reduce the dimensionality, facilitate the interpretation and can im-\nprove the accuracy of the predictive models. The method, which is a recursive\nextension of maxima hunting (MH), performs variable selection by identifying the\nmaxima of a relevance function, which measures the strength of the correlation of\nthe predictor functional variable with the class label. At each stage, the information\nassociated with the selected variable is removed by subtracting the conditional\nexpectation of the process. The results of an extensive empirical evaluation are\nused to illustrate that, in the problems investigated, RMH has comparable or higher\npredictive accuracy than standard dimensionality reduction techniques, such as\nPCA and PLS, and state-of-the-art feature selection methods for functional data,\nsuch as maxima hunting.\n\n1\n\nIntroduction\n\nIn many important prediction problems from different areas of application (medicine, environmental\nmonitoring, etc.) the data are characterized by a function, instead of by a vector of attributes, as is\ncommonly assumed in standard machine learning problems. Some examples of these types of data\nare functional magnetic resonance imaging (fMRI) (Grosenick et al., 2008) and near-infrared spectra\n(NIR) (Xiaobo et al., 2010). Therefore, it is important to develop methods for automatic induction\nthat take into account the functional structure of the data (in\ufb01nite dimension, high redundancy, etc.)\n(Ramsay and Silverman, 2005; Ferraty and Vieu, 2006). In this work, the problem of classi\ufb01cation\nof functional data is addressed. For simplicity, we focus on binary classi\ufb01cation problems (Ba\u00b4\u0131llo\net al., 2011). Nonetheless, the proposed method can be readily extended to a multiclass setting. Let\nX(t), t \u2208 [0, 1] be a continuous stochastic process in a probability space (\u2126,F, P). A functional\ndatum Xn(t) is a realization of this process (a trajectory). Let {Xn(t), Yn}Ntrain\n, t \u2208 [0, 1] be a\nset of trajectories labeled by the dichotomous variable Yn \u2208 {0, 1}. These trajectories come from\none of two different populations; either P0, when the label is Yn = 0, or P1, when the label is\nYn = 1. For instance, the data could be the ECG\u2019s from either healthy or sick persons (P0 and P1,\nrespectively). The classi\ufb01cation problem consist in deciding to which population a new unlabeled\nobservation X test(t) belongs (e.g., to decide from his or her ECG whether a person is healthy or\nnot). Speci\ufb01cally, we are interested in the problem of dimensionality reduction for functional data\nclassi\ufb01cation. The goal is to achieve the optimal discrimination performance using only a \ufb01nite, small\nset of values from the trajectory as input to a standard classi\ufb01er (in our work, k-nearest neighbors).\n\nn=1\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn general, to properly handle functional data, some kind of reduction of information is neces-\nsary. Standard dimensionality reduction methods in functional data analysis (FDA) are based\non principal component analysis (PCA) (Ramsay and Silverman, 2005) or partial least squares\n(PLS) (Preda et al., 2007). In this work, we adopt a different approach based on variable selec-\ntion (Guyon et al., 2006). The goal is to replace the complete function X(t) by a d-dimensional\nvector (X(t1), . . . , X(td)) for a set of \u201csuitable chosen\u201d points {t1, . . . , td} (for instance, instants in\na heartbeat in ECG\u2019s), where d is small.\nMost previous work on feature selection in supervised learning with functional data is quite recent\nand focuses on regression problems; for instance, on the analysis of fMRI images (Grosenick et al.,\n2008; Ryali et al., 2010) and NIR spectra (Xiaobo et al., 2010). In particular, adaptations of lasso\nand other embedded methods have been proposed to this end (see, e.g., Kneip and Sarda (2011);\nZhou et al. (2013); Aneiros and Vieu (2014)). In most cases, functional data are simply treated as\nhigh-dimensional vectors for which the standard methods apply. Speci\ufb01cally, G\u00b4omez-Verdejo et al.\n(2009) propose feature extraction from the functional trajectories before applying a multivariate\nvariable selector based on measuring the mutual information. Similarly, Fernandez-Lozano et al.\n(2015) compare different standard feature selection techniques for image texture classi\ufb01cation. The\nmethod of minimum Redundancy Maximum Relevance (mRMR) introduced by Ding and Peng (2005)\nhas been applied to functional data in Berrendero et al. (2016a). In that work distance correlation\n(Sz\u00b4ekely et al., 2007) is used instead of mutual information to measure nonlinear dependencies, with\ngood results. A fully functional perspective is adopted in Ferraty et al. (2010) and Delaigle et al.\n(2012). In these articles, a wrapper approach is used to select the optimal set of instants in which the\ntrajectories should be monitored by minimizing a cross-validation estimate of the classi\ufb01cation error.\nBerrendero et al. (2015) introduce a \ufb01lter selection procedure based on computing the Mahalanobis\ndistance and Reproducing Kernel Hilbert Space techniques. Logistic regression models have been\napplied to the problem of binary classi\ufb01cation with functional data in Lindquist and McKeague (2009)\nand McKeague and Sen (2010), assuming Brownian and fractional Brownian trajectories, respectively.\nFinally, the selection of intervals or elementary functions instead of variables is addressed in Li and\nYu (2008); Fraiman et al. (2016) or Tian and James (2013).\nFrom the analysis of previous work one concludes that, in general, it is preferable, both in terms of\naccuracy and interpretability, to adopt a fully functional approach to the problem. In particular, if\nthe data are characterized by functions that are continuous, values of the trajectory that are close to\neach other tend to be highly redundant and convey similar information. Therefore, if the value of the\nprocess at a particular instant has high discriminant capacity, one could think of discarding nearby\nvalues. This idea is exploited in maxima hunting (MH) (Berrendero et al., 2016b).\nIn this work, we introduce recursive Maxima Hunting (RMH), a novel variable selection method for\nfeature selection in functional data classi\ufb01cation that takes advantage of the good properties of MH\nwhile addressing some of its de\ufb01ciencies. The extension of MH consists in removing the information\nconveyed by each selected local maximum before searching for the next one in a recursive manner.\nThe rest of the paper is organized as follows: Maxima hunting for feature selection in classi\ufb01cation\nproblems with functional data is introduced in Section 2. Recursive maxima hunting, which is the\nmethod proposed in this work, is described in Section 3. The improvements that can be obtained with\nthis novel feature selection method are analyzed in an exhaustive empirical evaluations whose results\nare presented and discussed in Section 4.\n\n2 Maxima Hunting\n\nMaxima hunting (MH) is a method for feature selection in functional classi\ufb01cation based on measuring\ndependencies between values selected from {X(t), t \u2208 [0, 1]} and the response variable (Berrendero\net al., 2016b). In particular, one selects the values {X(t1), . . . , X(td)} whose dependence with the\nclass label (i.e., the response variable) is locally maximal. Different measures of dependency can\nbe used for this purpose. In Berrendero et al. (2016b), the authors propose the distance correlation\n(Sz\u00b4ekely et al., 2007). The distance covariance between the random variables X \u2208 Rp and Y \u2208 Rq,\nwhose components are assumed to have \ufb01nite \ufb01rst-order moments, is\n\n| \u03d5X,Y (u, v) \u2212 \u03d5X (u)\u03d5Y (v) |2 w(u, v)dudv,\n\n(1)\n\n2\n\n(cid:90)\n\nV 2(X, Y ) =\n\nRp+q\n\n\f|v|1+q\n\nwhere \u03d5X,Y , \u03d5X, \u03d5Y are the characteristic functions of (X, Y ), X and Y , respectively, w(u, v) =\n(cpcq|u|1+p\n\u0393((1+d)/2) is half the surface area of the unit sphere in Rd+1, and | \u00b7 |d\n)\u22121, cd = \u03c0(1+d)/2\nstands for the Euclidean norm in Rd.\nIn terms of V 2(X, Y ), the square of the distance correlation is\n\np\n\nq\n\nR2(X, Y ) =\n\n\u221a\n\nV 2(X,Y )\n\nV 2(X,X)V 2(Y,Y )\n\n0,\n\n, V 2(X)V 2(Y ) > 0\nV 2(X)V 2(Y ) = 0.\n\n(2)\n\n(cid:40)\n\nThe distance correlation is a measure of statistical independence; that is, R2(X, Y ) = 0 if and only\nif X and Y are independent. Besides being de\ufb01ned for random variables of different dimensions, it\nhas other valuable properties. In particular, it is rotationally invariant and scale equivariant (Sz\u00b4ekely\nand Rizzo, 2012). A further advantage over other measures of independence, such as the mutual\ninformation, is that the distance correlation can be readily estimated using a plug-in estimator that\ndoes not involve any parameter tuning. The almost sure convergence of the estimator V 2\nn is proved in\nSz\u00b4ekely et al. (2007, Thm. 2).\nTo summarize, in maxima hunting, one selects the d different local maxima of the distance correlation\nbetween X(t), the values of random process at different instants t \u2208 [0, 1], and the response variable\n(3)\n\nR2(X(t), Y ),\n\ni = 1, 2, . . . , d.\n\nX(ti) = argmax\nt\u2208[0,1]\n\nMaxima Hunting is easy to interpret. It is also well-motivated from the point of view of FDA, because\nit takes advantage of functional properties of the data, such as continuity, which implies that similar\ninformation is conveyed by the values of the function at neighboring points. In spite of the simplicity\nof the method, it naturally accounts for the relevance and redundancy trade-off in feature selection (Yu\nand Liu, 2004): the local maxima (3) are relevant for discrimination. Points around them, which do\nnot maximize the distance correlation with the class label, are automatically excluded. Furthermore, it\nis also possible to derive a uniform convergence result, which provides additional theoretical support\nfor the method. Finally, the empirical investigation carried out in Berrendero et al. (2016b) shows\nthat MH performs well in standard benchmark classi\ufb01cation problems for functional data. In fact,\nfor some problems, one can show that the optimal (Bayes) classi\ufb01cation rules depends only on the\nmaxima of R2(X(t), Y ).\nHowever, maxima hunting presents also some limitations. First, it is not always a simple task\nto estimate the local maxima, especially in functions that are very smooth or that vary abruptly.\nFurthermore, there is no guarantee that different maxima are not redundant. In most cases, the local\nmaxima of R2(X(t), Y ) are indeed relevant for classi\ufb01cation. However, there are important points\nfor which this quantity does not attain a maximum.\nAs an example, consider the family of classi\ufb01cation problems introduced in Berrendero et al. (2016b,\nProp. 3), in which the goal is to discriminate trajectories generated by a standard Brownian motion\nprocess, B(t), and trajectories from the process B(t) + \u03a6m,k(t), where\n\n( 2k\u22121\n\n(cid:105)\n(cid:1) \u2212 X(cid:0) 2k\u22122\n(cid:1)(cid:1) +(cid:0)X(cid:0) 2k\u22121\n(cid:16)(cid:107)\u03a6(cid:48)\n\n2m )(s)\n\n2m , 2k\n\n(cid:17)\n\n2m\n\n2m\n\n2m\n\n(4)\n\n( 2k\u22122\n\n2m\u22121\n\n\u03a6m,k(t) =\n\n2m , 2k\u22121\n\nAssuming a balanced class distribution (P(Y = 0) = P(Y = 1) = 1/2), the optimal classi\ufb01cation\n1\u221a\n2m+1 .\n\n2m )(s) \u2212 I\nrule is g\u2217(x) = 1 if and only if(cid:0)X(cid:0) 2k\u22121\n(cid:1) \u2212 X(cid:0) 2k\n(cid:1)(cid:1) >\n(cid:1) (cid:39) 0.3085,\n= 1 \u2212 normcdf(cid:0) 1\n(cid:1). However, the\nthe standard normal. The relevance function has a single maximum at X(cid:0) 2k\u22121\n\nThe optimal classi\ufb01cation error is L\u2217 = 1 \u2212 normcdf\nwhere, (cid:107) \u00b7 (cid:107) denotes the L2[0, 1] norm, and normcdf(\u00b7) is the cumulative distribution function of\n\nds, m, k \u2208 N, 1 \u2264 k \u2264 2m\u22121.\n\nm,k(t)(cid:107)\n2\n\nBayes classi\ufb01cation rule involves three relevant variables, two of which are clearly not maxima of\nR2(X(t), Y ). In spite of the simplicity of these types of functional classi\ufb01cation problems, they are\nimportant to analyze, because the set of functions \u03a6m,k, with m > 0 and k > 0 form an orthonormal\nbasis of the Dirichlet space D[0, 1], the space of continuous functions whose derivatives are in\nL2[0, 1]. Furthermore, this space is the reproducing kernel Hilbert space associated with Brownian\nmotion and plays and important role in functional classi\ufb01cation (M\u00a8orters and Peres, 2010; Berrendero\net al., 2015). In fact, any trend in the Brownian process can be approximated by a linear combination\nor by a mixture of \u03a6m,k(t).\n\n2m\n\n2m\n\n2\n\n(cid:90) t\n\n0\n\n\u221a\n\n(cid:104)I\n\n3\n\n\fFigure 1: First row: Individual and average trajectories for the classi\ufb01cation of B(t) vs. B(t) + 2\u03a63,3(t)\ninitially (left) and after the \ufb01rst (center) and second (right) corrections. Second row: Values of R2(X(t), Y ) as\na function of t. The variables required for optimal classi\ufb01cation are marked with vertical dashed lines.\n\nTo illustrate the workings of maxima hunting and its limitations we analyze in detail the classi\ufb01cation\nproblem B(t) vs. B(t) + 2\u03a63,3(t), which is of the type considered above. In this case, the optimal\nclassi\ufb01cation rule depends on the maximum X(5/8), and on X(1/2) and X(3/4), which are not\nmaxima, and would therefore not be selected by the MH algorithm. The optimal error is L\u2217 = 15.87%.\nTo illustrate the importance of selecting all the relevant variables, we perform simulations in which\nwe compare the accuracy of the linear Fisher discriminant with the maxima hunting selection, and\nwith the optimal variable selection procedures. In these experiments, independent training and test\nsamples of size 1000 are generated. The values reported are averages over 100 independent runs.\nStandard deviations are given between parentheses. The average prediction error when only the\nmaximum of the trajectories is considered is 37.63%(1.44%). When all three variables are used the\nempirical error is 15.98%(1%), which is close to the Bayes error. When other points in addition\nto the maximum are used (i.e., (X(t1), X(5/8), X(t2), with t1 and t2 randomly chosen so that\n0 \u2264 t1 < 5/8 < t2 \u2264 1) the average classi\ufb01cation error is 22.32%(2.18%). In the top leftmost plot\nof Figure 1 trajectories from both classes, together with the corresponding averages (thick lines) are\nshown. The relevance function R2(X(t), Y ) is plotted below. The relevant variables, which are\nrequired for optimal classi\ufb01cation, are marked by dashed vertical lines.\n\n3 Recursive Maxima Hunting\n\nAs a variable selection process, MH avoids, at least partially, the redundancy introduced by the\ncontinuity of the functions that characterize the instances. However, this local approach cannot detect\nredundancies among different local maxima. Furthermore, there could be points in the trajectory\nthat do not correspond to maxima of the relevance function, but which are relevant when considered\njointly with the maxima. The goal of recursive maxima hunting (RMH) is to select the maxima of\nR2(X(t), Y ) in a recursive manner by removing at each step the information associated to the most\nrecently selected maximum. This avoids the in\ufb02uence of previously selected maxima, which can\nobscure ulterior dependencies. The in\ufb02uence of a selected variable X(t0) on the rest of the trajectory\ncan be eliminated by subtracting the conditional expectation E(X(t)|X(t0)) from X(t). Assuming\nthat the underlying process is Brownian\n\nE(X(t)|X(t0)) =\n\nmin(t, t0)\n\nt0\n\nX(t0),\n\nt \u2208 [0, 1].\n\n(5)\n\nIn the subsequent iterations, there are two intervals: [t, t0] and [t0, 1]. Conditioned on the value at\nX(t0), the process in the interval [t0, 1] is still Brownian motion. By contrast, for the interval [0, t0]\nthe process is a Brownian bridge, whose conditional expectation is\n\nE(X(t)|X(t0) =\n\nmin(t, t0) \u2212 t t0\n\nt0(1 \u2212 t0)\n\nX(t0) =\n\n(cid:26) t\n\nX(t0),\nX(t0),\n\nt0\n1\u2212t\n1\u2212t0\n\nt < t0\nt > t0.\n\n(6)\n\nAs illustrated by the results in the experimental section, the Brownian hypothesis is a robust assump-\ntion. Nevertheless, if additional information on the underlying stochastic processes is available, it can\n\n4\n\n01/41/23/41X(t)-2-1012Initialsteptime01/41/23/41R2(X(t);Y)00,101/41/23/41-2-1012After-rstcorrectiontime01/41/23/4100.401/41/23/41-2-1012Aftersecondcorrectiontime01/41/23/4100.1\fbe incorporated to the algorithm during the calculation of the conditional expectation in Equations (5)\nand (6).\nThe center and right plots in Figure 1 illustrate the behavior of RMH in the example described in\nthe previous section. The top center plot diplays the trajectories and corresponding averages (thick\nlines) for both classes after applying the correction (5) with t0 = 5/8, which is the \ufb01rst maximum of\nthe distance correlation function (bottom leftmost plot in Figure 1). The variable X(5/8) is clearly\nuninformative once this correction has been applied. The distance correlation R2(X(t), Y ) for the\ncorrected trajectories is displayed in the bottom center plot. Also in this plot the relevant variables\nare marked by vertical dashed lines. It is clear that the subsequent local maxima at t = 1/2, in the\nsubinterval [0, 5/8], and at t = 3/4, in the subinterval, and [5/8, 1] correspond to the remaining\nrelevant variables. The last column shows the corresponding plots after the correction is applied\nanew (equations (6) with t0 = 1/2 in [0, 5/8] and (5) with t0 = 3/4 in [5/8, 1]). After this second\ncorrection, the discriminant information has been removed. In consequence, the distance correlation\nfunction, up to sample \ufb02uctuations, is zero.\nAn important issue in the application of this method is how to decide when to stop the recursive\nsearch. The goal is to avoid including irrelevant and/or redundant variables. To address the \ufb01rst\nproblem, we only include maxima that are suf\ufb01ciently prominent R2(X(tmax), Y ) > s, where\n0 < s < 1 can be used to gauge the relative importance of the maximum. Redundancy is avoided\nby excluding points around a selected maximum tmax for which R2(X(tmax), X(t)) \u2265 r, for some\nredundancy threshold 0 < r < 1, which is typically close to one. As a result of these two conditions\nonly a \ufb01nite (typically small) number of variables are selected. This data-driven stopping criterion\navoids the need to set the number of selected variables beforehand or to determine this number by a\ncostly validation procedure. The sensitivity of the results to the values of r and s will be studied in\nSection 4. Nonetheless, RMH has a good and robust performance for a wide range of reasonable\nvalues of these parameters (r close to 1 and s close to 0). The pseudocode of the RMH algorithm is\ngiven in Algorithm 1.\n\nAlgorithm 1 Recursive Maxima Hunting\n1: function RMH(X(t), Y )\nt\u2217 \u2190 [ ]\n2:\nRMH rec(X(t),Y ,0,1)\n3:\nreturn t\u2217\n4:\n5: end function\n6: procedure RMH REC(X(t), Y, tinf , tsup)\n7:\n\n(cid:8)R2(X(t), Y )(cid:9)\n\n(cid:46) Vector of selected points initially empty\n(cid:46) Recursive search of the maxima of R2(X(t), Y )\n(cid:46) Vector of selected points\n\ntmax \u2190 argmax\ntinf \u2264t\u2264tsup\nif R2(X(tmax), Y ) > s then\nt\u2217 \u2190 [t\u2217 tmax]\n(cid:46) Include tmax in t\u2217 the vector of selected points\nX(t) \u2190 X(t) \u2212 E(X(t) | X(tmax)), t \u2208 [tinf , tsup] (cid:46) Correction of type (5) or (6) as required\nreturn\n\nelse\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\n16:\n17:\n18:\n19:\n20:\n\n21:\n22:\n23:\n24:\n25: end procedure\n\nend if\nreturn\n\n(cid:8)t : R2 (X(tmax), X(t)) \u2264 r(cid:9)\n\nend if\n(cid:46) Exclude redundant points to the left of tmax\nmax \u2190 max\nt\u2212\ntinf \u2264t<tmax\nif t\u2212\nmax > tinf then\nRMH rec(X(t), Y, tinf , t\u2212\n\n(cid:8)t : R2 (X(tmax), X(t)) \u2264 r(cid:9)\n\nend if\n(cid:46) Exclude redundant points to the right of tmax\nmax \u2190 min\nt+\ntmax<t\u2264tsup\nmax < tsup then\nif t+\nRMH rec(X(t), Y, t+\n\nmax, tsup)\n\nmax)\n\n5\n\n(cid:46) Recursion on left subinterval\n\n(cid:46) Recursion on right subinterval\n\n\f4 Empirical study\n\n\u221a\n\nTo assess the performance of RMH, we have carried out experiments in simulated and real-world\ndata in which it is compared with some well-established dimensionality reduction methods, such\nas PCA (Ramsay and Silverman, 2005) and partial least squares (Delaigle and Hall, 2012b), and\nwith Maxima Hunting (Berrendero et al., 2016b). In these experiments, k-nearest neighbors (kNN)\nwith the Euclidean distance is used for classi\ufb01cation. kNN has been selected because it is a simple,\nnonparametric classi\ufb01er with reasonable overall predictive accuracy. The value k in kNN is selected\nby 10-fold CV from integers in [1,\nNtrain], where Ntrain is the size of the training set. Since RMH\nis a \ufb01lter method for variable selection, the results are expected to be similar when other types of\nclassi\ufb01ers are used. As a reference, the results of kNN using complete trajectories (i.e., without\ndimensionality reduction) are also reported. This approach is referred to as Base. Note that, in this\ncase, the performance of kNN need not be optimal because of the presence of irrelevant attributes.\nRMH requires determining the values of two hyperparameters: the redundancy threshold r (0 < r < 1\ntypically close to 1), and the relevance threshold s (0 < s < 1 typically close to 0). Through extensive\nsimulations we have observed that RMH is quite robust for a wide range of appropriate values of\nthese parameters. In particular, the results are very similar for values of r in the interval [0.75, 0.95].\nThe predictive accuracy is somewhat more sensitive to the choice of s: If the value of s is too small,\nirrelevant variables can be selected. If s is too large, it is possible that relevant points are excluded.\nFor most of the experiments performed, the optimal values of s are between 0.025 and 0.1. In view\nof these observations, the experiments are made using r = 0.8. The value of s is selected from the set\n{0.025, 0.05, 0.1} by 10-fold CV. A more careful determination of r and s is bene\ufb01cial, especially in\nsome extreme problems (e.g., with very smooth or with rapidly-varying trajectories). In RMH, the\nnumber of selected variables, which is not determined beforehand, depends indirectly on the values\nof r and s. In the other methods, the number of selected variables is determined using 10-fold CV,\nwith maximum of 30.\nA \ufb01rst batch of experiments is carried out on simulated data generated from the model\n\n(cid:26) P0 : B(t)\n\nP1 : B(t) + m(t)\n\nt \u2208 [0, 1]\nt \u2208 [0, 1]\n\n,\n\n,\n,\n\nwhere B(t) is standard Brownian motion, m(t) is a deterministic trend, and P(Y = 0) = P(Y =\n1) = 1/2. Using Berrendero et al. (2015, Theorem 2), it is possible to compute the optimal\nclassi\ufb01cation rules g\u2217 and the corresponding Bayes errors L\u2217. To ensure a wide coverage, we\nconsider two problems in which the Bayes rule depends only on a few variables and two problems in\nwhich complete trajectories are needed for optimal classi\ufb01cation: (i) Peak: m(t) = 2\u03a63,3(t). The\noptimal rule depends only on X(1/2), X(5/8) and X(3/4). The Bayes error is L\u2217 (cid:39) 0.1587. This\nis the example analyzed in the previous section. (ii) Peak2: m(t) = 2\u03a63,2(t) + 3\u03a63,3(t) \u2212 2\u03a62,2(t).\nThe optimal rule depends only on X(1/4), X(3/8), X(1/2), X(5/8), X(3/4), and X(1). The\nBayes error is L\u2217 (cid:39) 0.0196. (iii) Square: m(t) = 2t2. The Bayes error is L\u2217 (cid:39) 0.1241. (iv) Sin:\nm(t) = 1/2 sin(2\u03c6t). The Bayes error is L\u2217 (cid:39) 0.1333. In Figure 2 we have plotted some trajectories\ncorresponding to class 1 instances, together with their corresponding averages (thick lines). Class 0\ntrajectories are realizations of a standard Brownian process. In these experiments, training samples\nof different sizes (Ntrain = {50, 100, 200, 500, 1000}) and an independent test set of size 1000 are\ngenerated. The trajectories are discretized in 200 points. Half of the trajectories belong to each class\nin both the training and test sets. The values reported are averages over 200 independent repetitions.\nFigure 3 displays the average classi\ufb01cation error (\ufb01rst row) and the average number of selected\nvariable /components (second row) as a function of the training sample size for each model and\nclassi\ufb01cation method. Horizontal dashed lines are used to indicate the Bayes error level in the\ndifferent problems. From the results reported in Figure 3, one concludes that RMH has the best\noverall performance. It is always more accurate than the Base method. This observation justi\ufb01es\nperforming variable selection not only for the sake of dimensionality reduction, but also to improve\nthe classi\ufb01cation accuracy. RMH is also better than the original MH in all the problems investigated:\nthere is both an improvement of the prediction error, and a reduction of the numeber of variables\nused for classi\ufb01cation. In peak and peak2, problems in which the relevant variables are known, RMH\ngenerally selects the correct ones. As expected, PLS performs better than PCA. However, both MH\nand RMH outperform these projection methods, except in sin, where their accuracies are similar.\nBoth PLS and RMH are effective dimensionality reduction methods with comparable performance.\n\n6\n\n\fFigure 2: Class 1 trajectories and averages (thick lines) for the different synthetic problems.\n\nFigure 3: Average classi\ufb01cation error (\ufb01rst row) and average number of selected variables/components (second\nrow) as a function of the size of the training.\n\nHowever, the components selected in PLS are, in general, more dif\ufb01cult to interpret because they\ninvolve whole trajectories. Finally, the accuracy of RMH is very close to the Bayes level for higher\nsample sizes, even when the optimal rule requires using complete trajectories (square and sin).\nTo assess the performance of RMH in real-world functional classi\ufb01cation problems, we have carried\nout a second batch of experiments in four datasets, which are commonly used as benchmarks in the\nFDA literature. Instances in Growth correspond to curves of the heights of 54 girls and 38 boys\nfrom the Berkeley Growth Study. Observations are discretized in 31 non-equidistant ages between 1\nand 18 years (Ramsay and Silverman, 2005; Mosler and Mozharovskyi, 2014). The Tecator dataset\nconsists of 215 near-infrared absorbance spectra of \ufb01nely chopped meat. The spectral curves consist\nof 100 equally spaced points. The class labels are determined in terms of fat content (above or below\n20%). The curves are fairly smooth. In consequence, we have followed the general recommendation\nand used the second derivative for classi\ufb01cation (Ferraty and Vieu, 2006; Galeano et al., 2014).\nThe Phoneme data consists of 4509 log-periodograms observed at 256 equidistant points. Here, we\nconsider the binary problem of distinguishing between the phonemes \u201caa\u201d (695) and \u201cao\u201d (1022)\n(Galeano et al., 2014). Following Delaigle and Hall (2012a), the curves are smoothed with a local\nlinear method and truncated to the \ufb01rst 50 variables. The Med\ufb02ies are records of daily egg-laying\npatterns of a thousand \ufb02ies. The goal is to discriminate between short- and long-lived \ufb02ies. Following\nMosler and Mozharovskyi (2014), curves equal to zero are excluded. There are 512 30-day curves\n(starting from day 5) of \ufb02ies who live at most 34 days, 266 of these are long-lived (reach the day\n44). The classes in Growth and Tecator are well separated. In consequence, they are relatively easy\nproblems. By contrast, Phoneme and Med\ufb02ies are notoriously dif\ufb01cult classi\ufb01cation tasks. Some\ntrajectories of each problem and each class, together with the corresponding averages (thick lines),\nare plotted in Figure 4. To estimate the classi\ufb01cation error, the datasets are partitioned at random\ninto a training set (with 2/3 of the observations) and a test set (1/3). This procedure is repeated\n200 times. The boxplots of the results for each dataset and method are shown in Figure 5. Errors\nare shown in \ufb01rst row and the number of selected variables/components in the second one. From\n\n7\n\n\u22123\u22122\u221210123X(t)|Y=1Peak\u22123\u22122\u221210123Peak2\u22123\u22122\u221210123Square\u22123\u22122\u221210123Sin50100200500100000.050.10.150.20.250.3Peak250100200500100005101520Ntrain5010020050010000.150.20.250.30.350.40.45Classi\ufb01cationerrorPeak50100200500100005101520NtrainNumberofvariables5010020050010000.140.160.180.20.220.240.26SinPCAPLSMHRMHBaseL*50100200500100005101520Ntrain5010020050010000.120.140.160.180.20.220.24Square50100200500100005101520Ntrain\fFigure 4: Trajectories for each of the classes and their corresponding averages (thick lines).\n\nFigure 5: Classi\ufb01cation error (\ufb01rst row) and number of variables/components selected (second row) by RMH.\n\nthese results we observe that, in general, dimensionality reduction is effective: the accuracy of the\nfour considered methods is similar or better than the Base method, in which complete trajectories\nare used for classi\ufb01cation. In particular, Base does not perform well when the trajectories are not\nsmooth (Med\ufb02ies). The best overall performance corresponds to RMH. In the easy problems (Growth\nand Tecator), all methods behave similarly and give good results. In Growth, RMH is slightly more\naccurate. However, it tends to select more variables than the other methods. In the more dif\ufb01cult\nproblems, (Phoneme and Med\ufb02ies), RMH yields very accurate predictions while selecting only two\nvariables. In these problems it exhibits the best performance, except in Phoneme, where Base is\nmore accurate. The variables selected by RMH and MH are directly interpretable, which is an\nadvantage over projection-based methods (PCA, PLS). Finally, let us point out that the accuracy of\nRMH is comparable and often better that state-of-the-art functional classi\ufb01cation methods. See, for\ninstance, Berrendero et al. (2016a); Delaigle et al. (2012); Delaigle and Hall (2012a); Mosler and\nMozharovskyi (2014); Galeano et al. (2014). In most of these works no dimensionality reduction is\napplied. Nevertheless, these comparisons must be done carefully because the evaluation protocol and\nthe classi\ufb01ers used vary in the different studies. In any case, RMH is a \ufb01lter method, which means\nthat it could be more effective if used in combination with other types of classi\ufb01ers or adapted and\nused as a wrapper or, even, as an embedded variable selection method.\n\nAcknowledgments\n\nThe authors thank Dr. Jos\u00b4e R. Berrendero for his insightful suggestions. We also acknowledge\n\ufb01nancial support from the Spanish Ministry of Economy and Competitiveness, project TIN2013-\n42351-P and from the Regional Government of Madrid, CASI-CAM-CM project (S2013/ICE-2845).\n\n8\n\n10203050100150200X(t)|Y=0Growth10203050100150200X(t)|Y=120406080100\u22124\u22122024x 10\u22123Tecator20406080100\u22124\u22122024x 10\u22123102030405010152025Phoneme10203040501015202551015202530050100150Med\ufb02ies51015202530050100150PCA PLS MHRMH Base00.050.10.150.20.250.3Classi\ufb01cationerrorGrowthPCA PLS MHRMH024681012NumberofvariablesPCA PLS MHRMH Base00.020.040.060.08 TecatorPCA PLS MHRMH051015 PCA PLS MHRMH Base0.160.180.20.220.24 PhonemePCA PLS MHRMH024681012 PCA PLS MH RMH Base0.30.40.50.6 Med\ufb02iesPCA PLS MHRMH051015202530 \fReferences\nAneiros, G. and P. Vieu (2014). Variable selection in in\ufb01nite-dimensional problems. Statistics & Probability\n\nBa\u00b4\u0131llo, A., A. Cuevas, and R. Fraiman (2011). Classi\ufb01cation methods for functional data, pp. 259\u2013297. Oxford:\n\nLetters 94, 12\u201320.\n\nOxford University Press.\n\nBerrendero, J. R., A. Cuevas, and J. L. Torrecilla (2015). On near perfect classi\ufb01cation and functional Fisher\n\nrules via reproducing kernels. arXiv:1507.04398, 1\u201327.\n\nBerrendero, J. R., A. Cuevas, and J. L. Torrecilla (2016a). The mRMR variable selection method: a comparative\n\nstudy for functional data. Journal of Statistical Computation and Simulation 86(5), 891\u2013907.\n\nBerrendero, J. R., A. Cuevas, and J. L. Torrecilla (2016b). Variable selection in functional data classi\ufb01cation: a\n\nmaxima hunting proposal. Statistica Sinica 26(2), 619\u2013638.\n\nDelaigle, A. and P. Hall (2012a). Achieving near perfect classi\ufb01cation for functional data. Journal of the Royal\n\nStatistical Society B 74(2), 267\u2013286.\n\nThe Annals of Statistics 40(1), 322\u2013352.\n\nBiometrika 99(2), 299\u2013313.\n\nDelaigle, A. and P. Hall (2012b). Methodology and theory for partial least squares applied to functional data.\n\nDelaigle, A., P. Hall, and N. Bathia (2012). Componentwise classi\ufb01cation and clustering of functional data.\n\nDing, C. and H. Peng (2005). Minimum redundancy feature selection from microarray gene expression data.\n\nJournal of Bioinformatics and Computational Biology 3(2), 185\u2013205.\n\nFernandez-Lozano, C., J. A. Seoane, M. Gestal, T. R. Gaunt, J. Dorado, and C. Campbell (2015). Texture\n\nclassi\ufb01cation using feature selection and kernel-based techniques. Soft Computing 19(9), 2469\u20132480.\n\nFerraty, F., P. Hall, and P. Vieu (2010). Most-predictive design points for functional data predictors.\n\nFerraty, F. and P. Vieu (2006). Nonparametric Functional Data Analysis: Theory and Practice. Springer.\nFraiman, R., Y. Gim\u00b4enez, and M. Svarc (2016). Feature selection for functional data. Journal of Multivariate\n\nBiometrika 97(4), 807\u2013824.\n\nAnalysis 146, 191\u2013208.\n\nGaleano, P., E. Joseph, and R. E. Lillo (2014). The Mahalanobis distance for functional data with applications to\n\nclassi\ufb01cation. Technometrics 57(2), 281\u2013291.\n\nG\u00b4omez-Verdejo, V., M. Verleysen, and J. Fleury (2009). Information-theoretic feature selection for functional\n\ndata classi\ufb01cation. Neurocomputing 72(16), 3580\u20133589.\n\nGrosenick, L., S. Greer, and B. Knutson (2008). Interpretable classi\ufb01ers for FMRI improve prediction of\n\npurchases. Neural Systems and Rehabilitation Engineering, IEEE Transactions on 16(6), 539\u2013548.\n\nGuyon, I., S. Gunn, M. Nikravesh, and L. A. Zadeh (2006). Feature Extraction: Foundations and Applications.\n\nSpringer.\n\nThe Annals of Statistics 39(5), 2410\u20132447.\n\nData Analysis 52(10), 4790\u20134800.\n\nKneip, A. and P. Sarda (2011). Factor models and variable selection in high-dimensional regression analysis.\n\nLi, B. and Q. Yu (2008). Classi\ufb01cation of functional data: A segmentation approach. Computational Statistics &\n\nLindquist, M. A. and I. W. McKeague (2009). Logistic regression with Brownian-like predictors. Journal of the\n\nAmerican Statistical Association 104(488), 1575\u20131585.\n\nMcKeague, I. W. and B. Sen (2010). Fractals with point impact in functional linear regression. Annals of\n\nStatistics 38(4), 2559.\n\ntics 22(2), 223\u2013235.\n\nM\u00a8orters, P. and Y. Peres (2010). Brownian Motion. Cambridge University Press.\nMosler, K. and P. Mozharovskyi (2014). Fast DD-classi\ufb01cation of functional data. Statistical Papers 55, 49\u201359.\nPreda, C., G. Saporta, and C. L\u00b4ev\u00b4eder (2007). PLS classi\ufb01cation of functional data. Computational Statis-\n\nRamsay, J. O. and B. W. Silverman (2005). Functional Data Analysis. Springer.\nRyali, S., K. Supekar, D. A. Abrams, and V. Menon (2010). Sparse logistic regression for whole-brain\n\nclassi\ufb01cation of fMRI data. NeuroImage 51(2), 752\u2013764.\n\nSz\u00b4ekely, G. J. and M. L. Rizzo (2012). On the uniqueness of distance covariance. Statistics & Probability\n\nLetters 82(12), 2278\u20132282.\n\nSz\u00b4ekely, G. J., M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of\n\ndistances. The Annals of Statistics 35(6), 2769\u20132794.\n\nTian, T. S. and G. M. James (2013). Interpretable dimension reduction for classifying functional data. Computa-\n\ntional Statistics & Data Analysis 57(1), 282\u2013296.\n\nXiaobo, Z., Z. Jiewen, M. J. Povey, M. Holmes, and M. Hanpin (2010). Variables selection methods in\n\nnear-infrared spectroscopy. Analytica Chimica Acta 667(1), 14\u201332.\n\nYu, L. and H. Liu (2004). Ef\ufb01cient feature selection via analysis of relevance and redundancy. The Journal of\n\nZhou, J., N.-Y. Wang, and N. Wang (2013). Functional linear model with zero-value coef\ufb01cient function at\n\nMachine Learning Research 5, 1205\u20131224.\n\nsub-regions. Statistica Sinica 23(1), 25\u201350.\n\n9\n\n\f", "award": [], "sourceid": 2452, "authors": [{"given_name": "Jos\u00e9", "family_name": "Torrecilla", "institution": "Universidad Aut\u00f3noma de Madrid"}, {"given_name": "Alberto", "family_name": "Su\u00e1rez", "institution": "Universidad Aut\u00f3noma de Madrid"}]}