{"title": "On Binary Classification in Extreme Regions", "book": "Advances in Neural Information Processing Systems", "page_first": 3092, "page_last": 3100, "abstract": "In pattern recognition, a random label Y is to be predicted based upon observing a random vector X valued in $\\mathbb{R}^d$ with d>1 by means of a classification rule with minimum probability of error. In a wide variety of applications, ranging from finance/insurance to environmental sciences through teletraffic data analysis for instance, extreme (i.e. very large) observations X are of crucial importance, while contributing in a negligible manner to the (empirical) error however, simply because of their rarity. As a consequence, empirical risk minimizers generally perform very poorly in extreme regions. It is the purpose of this paper to develop a general framework for classification in the extremes. Precisely, under non-parametric heavy-tail assumptions for the class distributions, we prove that a natural and asymptotic notion of risk, accounting for predictive performance in extreme regions of the input space, can be defined and show that minimizers of an empirical version of a non-asymptotic approximant of this dedicated risk, based on a fraction of the largest observations, lead to classification rules with good generalization capacity, by means of maximal deviation inequalities in low probability regions. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed.", "full_text": "On Binary Classi\ufb01cation in Extreme Regions\n\nHamid Jalalzai, Stephan Cl\u00b4emenc\u00b8on and Anne Sabourin\n\nLTCI Telecom ParisTech, Universit\u00b4e Paris-Saclay\n\n75013, Paris, France\n\nfirst.last@telecom-paristech.fr\n\nAbstract\n\nIn pattern recognition, a random label Y is to be predicted based upon observ-\ning a random vector X valued in Rd with d \u2265 1 by means of a classi\ufb01cation\nrule with minimum probability of error. In a wide variety of applications, ranging\nfrom \ufb01nance/insurance to environmental sciences through teletraf\ufb01c data analysis\nfor instance, extreme (i.e. very large) observations X are of crucial importance,\nwhile contributing in a negligible manner to the (empirical) error however, simply\nbecause of their rarity. As a consequence, empirical risk minimizers generally\nperform very poorly in extreme regions.\nIt is the purpose of this paper to de-\nvelop a general framework for classi\ufb01cation in the extremes. Precisely, under\nnon-parametric heavy-tail assumptions for the class distributions, we prove that\na natural and asymptotic notion of risk, accounting for predictive performance in\nextreme regions of the input space, can be de\ufb01ned and show that minimizers of an\nempirical version of a non-asymptotic approximant of this dedicated risk, based\non a fraction of the largest observations, lead to classi\ufb01cation rules with good\ngeneralization capacity, by means of maximal deviation inequalities in low proba-\nbility regions. Beyond theoretical results, numerical experiments are presented in\norder to illustrate the relevance of the approach developed.\n\n1\n\nIntroduction\n\nBecause it covers a wide range of practical applications and its probabilistic theory can be straight-\nforwardly extended to some extent to various other prediction problems, binary classi\ufb01cation can\nbe considered as the \ufb02agship problem in statistical learning.\nIn the standard setup, (X, Y ) is a\nrandom pair de\ufb01ned on a certain probability space with (unknown) joint probability distribution\nP , where the (output) r.v. Y is a binary label, taking its values in {\u22121, +1} say, and X mod-\nels some information, valued in Rd and hopefully useful to predict Y . In this context, the goal\npursued is generally to build, from a training sample Dn = {(X1, Y1),\n. . . , (Xn, Yn)} com-\nposed of n \u2265 1 i.i.d.\nrealizations of (X, Y ), a classi\ufb01er g : Rd \u2192 {\u22121, +1} minimizing the\nprobability of error LP (g) = P{Y (cid:54)= g(X)}. The Empirical Risk Minimization paradigm (ERM\nin abbreviated form, see e.g. [5]) suggests considering solutions gn of the minimization problem\nIn general the empirical\n1{Yi (cid:54)= g(Xi)} is considered, denoting by 1{E} the indicator func-\ntion of any event E. This amounts to replacing P in LP with the empirical distribution of the\n(Xi, Yi)\u2019s. The class G of predictive rules is supposed to be rich enough to contain a reasonable\napproximant of the minimizer of LP , i.e. the Bayes classi\ufb01er g\u2217(x) = 2 1{\u03b7(x) \u2265 1/2} \u2212 1, where\n\u03b7(X) = P{Y = 1 | X} denotes the posterior probability.\nBecause extreme observations X, i.e. observations whose norm (cid:107)X(cid:107) exceeds some large threshold\nt > 0, are rare and thus underrepresented in the training dataset Dn classi\ufb01cation errors in these\n\nming\u2208G(cid:98)Ln(g), where (cid:98)Ln(g) is a statistical estimate of the risk L(g).\nversion(cid:98)Ln(g) = (1/n)(cid:80)n\n\ni=1\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fregions of the input space may have a negligible impact on the global prediction error of(cid:98)gn. Notice\n\nincidentally that the threshold t may depend on n, since \u2018large\u2019 should be naturally understood as\nlarge w.r.t the vast majority of data previously observed. Using the total probability formula, one\nmay indeed write\n\nLt(g) := LPt(g) = P{Y (cid:54)= g(X) | (cid:107)X(cid:107) > t},\n\nLP (g) = P{(cid:107)X(cid:107) > t}P{Y (cid:54)= g(X) | (cid:107)X(cid:107) > t} + P{(cid:107)X(cid:107) \u2264 t}P{Y (cid:54)= g(X) | (cid:107)X(cid:107) \u2264 t}.\n\n(cid:107)x(cid:107) > t}. In other words the quantity P{Y (cid:54)=(cid:98)gn(X) | (cid:107)X(cid:107) > t} may not\n\n(1)\nHence, due to the extremely small order of magnitude of P{(cid:107)X(cid:107) > t} and of its empirical coun-\nterpart, there is no guarantee that the standard ERM strategy produces an optimal classi\ufb01er on the\nextreme region {x :\nbe nearly optimal, whereas in certain practical applications (e.g. \ufb01nance, insurance, environmental\nsciences, aeronautics safety), accurate prediction in extreme regions is crucial.\nThe purpose of the subsequent analysis is to investigate the problem of building a classi\ufb01er such that\nthe \ufb01rst term of the decomposition (1) is asymptotically minimum as t \u2192 +\u221e. We thus consider\nthe conditional probability of error, which quantity is next referred to as the classi\ufb01cation risk above\nlevel t, given by\n(2)\ndenoting by Pt the conditional distribution of (X, Y ) given (cid:107)X(cid:107) > t. In this paper, we address the\nissue of learning a classi\ufb01er gn whose risk Lt(gn) is asymptotically minimum as t \u2192 \u221e with high\nprobability. In order to develop a framework showing that a variant of the ERM principle tailored\nto this statistical learning problem leads to predictive rules with good generalization capacities,\n(non-parametric) distributional assumptions related to the tail behavior of the class distributions\nF+ and F\u2212, the conditional distributions of the input r.v. X given Y = +/ \u2212 1, are required.\nPrecisely, we assume that they are both multivariate regularly varying, which correspond to a large\nnon-parametric class of (heavy-tailed) distributions, widely used in applications where the impact\nof extreme observations should be enhanced, or at least not neglected. Hence, under appropriate\nnon-parametric assumptions on F+ and F\u2212, as well as on the tail behavior of \u03b7(x), we prove that\nming Lt(g) converges to a quantity denoted by L\u2217\n\u221e and referred to as the asymptotic risk in the\nextremes, as t \u2192 \u221e. It is also shown that this limit can be interpreted as the minimum classi\ufb01cation\nerror related to a (non observable) random pair (X\u221e, Y\u221e), whose distribution P\u221e corresponds to the\nlimit of the conditional distribution of (X, Y ) given (cid:107)X(cid:107) > t, for an appropriate normalization of\nX, as t \u2192 \u221e. With respect to the goal set above we next investigate the performance of minimizer\n\n\u221e are established, of order OP(1/\n\n(cid:98)gn,\u03c4 of an empirical version of the risk LPt\u03c4 , where t\u03c4 is the (1 \u2212 \u03c4 ) quantile of the r.v. (cid:107)X(cid:107)\nand \u03c4 (cid:28) 1. The computation of(cid:98)gn,\u03c4 involves the (cid:98)n\u03c4(cid:99) input observations with largest norm, and\nLt((cid:98)gn,\u03c4 ) \u2212 L\u2217\n\nthe minimization is performed over a collection of classi\ufb01ers of \ufb01nite VC dimension. Based on\na variant of the VC inequality tailored to low probability regions, rate bounds for the deviation\nn\u03c4 ) namely. These theoretical results are also\n\nillustrated by preliminary experiments based on synthetic data.\nThe rest of the paper is organized as follows. Multivariate extreme value theory (MEVT) notions\ninvolved in the framework we develop are described in section 2, together with the probabilistic\nsetup we consider for classi\ufb01cation in the extremes. A notion of risk tailored to this statistical\nlearning task is also introduced therein. Section 3 investigates how to extend the ERM principle in\nthis situation. In particular, probability bounds proving the generalization ability of minimizers of a\nnon-asymptotic approximant of the risk previously introduced are established. Illustrative numerical\nresults are displayed in section 4, while several concluding remarks are collected in section 5. Some\ntechnical details and proofs are deferred to the Supplementary Material.\n\n\u221a\n\n2 Probabilistic Framework - Preliminary Results\n\nWe start off with recalling concepts pertaining to MEVT and next develop a general framework\nin order to formulate the problem of binary classi\ufb01cation in the extremes in a rigorous manner.\nFor completeness, additional details about regular variation and vague convergence are given in the\nsupplementary material (Appendix A).\n\n2.1 Regularly Varying Random Vector\n\nBy de\ufb01nition, heavy-tail phenomena are those which are ruled by very large values, occurring with\na far from negligible probability and with signi\ufb01cant impact on the system under study. When the\n\n2\n\n\fphenomenon of interest is described by the distribution of a univariate random variable, the theory of\nregularly varying functions provides the appropriate mathematical framework for the study of heavy-\ntailed distributions. One may refer to [11] for an excellent account of the theory of regularly varying\nfunctions and its application to the study of heavy-tailed distributions. For examples of works where\nsuch assumptions are considered in the context of statistical learning, see e.g. [6, 3, 12, 10, 1] or [8].\nLet \u03b1 > 0, a random variable X is said to be regularly varying with tail index \u03b1 if\n\nP{X > tx | X > t} \u2212\u2212\u2212\u2192\n\nt\u2192\u221e x\u2212\u03b1,\n\nx > 1.\n+ with b(t) \u2192 \u221e such that for\nThis is the case if and only if there exists a function b : R+ \u2192 R\u2217\nall x > 0, the quantity tP{X/b(t) > x} converges to some limit h(x) as t \u2192 \u221e. Then b may\nbe chosen as b(t) = t1/\u03b1 and h(x) = cx\u2212\u03b1 for some c > 0. Based on this characterization, the\nheavy-tail model can be extended to the multivariate setup. Consider a d-dimensional random vector\nX = (X (1), . . . , X (d)) taking its values in Rd\n+. Assume that all the X (j) are regularly varying with\nindex \u03b1 > 0. Then the random vector X is said to be regularly varying with tail index \u03b1 if there\nexists a non null positive Radon measure \u00b5 on the punctured space E = [0, \u221e]d\\{0} and a function\nb(t) \u2192 \u221e such that for all Borel set A \u2282 E such that 0 /\u2208 \u2202A and \u00b5(\u2202A) = 0,\n\ntP{X/b(t) \u2208 A} \u2212\u2212\u2212\u2192\n\nt\u2192\u221e \u00b5(A).\n\nIn such a case, the so-called exponent measure \u00b5 ful\ufb01lls the homogeneity property \u00b5(tC) =\nt\u2212\u03b1\u00b5(C) for all t > 0 and any Borel set C \u2282 E. This suggests a decomposition of \u00b5 into a\nradial component and an angular component \u03a6. For all x = (x1, . . . , xd) \u2208 Rd\n\n+, set\n\n\uf8f1\uf8f2\uf8f3 R(x) = (cid:107)x(cid:107) ,\n(cid:18) x1\n\n\u0398(x) =\n\n, . . . ,\n\nxd\nR(x)\n\nR(x)\n\n(cid:19)\n\n\u2208 S,\n\nwhere S is the positive orthant of the unit sphere in Rd for the chosen norm (cid:107) \u00b7 (cid:107). The choice of the\nnorm is unimportant as all norms are equivalent in Rd. De\ufb01ne an angular measure \u03a6 on S as\n\n\u03a6(B) = \u00b5{r\u03b8 : \u03b8 \u2208 B, r \u2265 1}, B \u2282 S, measurable.\n\nThe angular measure \u03a6 is \ufb01nite, and the conditional distribution of (R(X)/t, \u0398(X)) given that\nR(X) > t converges as t \u2192 \u221e to a limit which admits the following product decomposition: for\nr \u2265 1 and B \u2282 S such that \u03a6(\u2202B) = 0,\n\nlim\nt\u2192\u221e\n\nP{R(X)/t > r, \u0398(X) \u2208 B | R(X) > t} = c \u00b5{x : R(x) > r, \u0398(x) \u2208 B}\n\n= c r\u2212\u03b1 \u03a6(B),\n\nwhere c = \u00b5{x : R(x) > 1}\u22121 = \u03a6(S)\u22121 is a normalizing constant. Thus c\u03a6 may be viewed as\nthe limiting distribution of \u0398(X) given that R(X) is large.\n\nRemark 1 It is assumed above that all marginal distributions are tail equivalent to the Pareto\ndistribution with index \u03b1.\nIn practice, the tails of the marginals may be different and it may\nbe convenient to work with marginally standardized variables, that is, to separate the margins\nFj(xj) = P{X (j) \u2264 xj} from the dependence structure in the description of the joint dis-\ntribution of X. Consider the standardized variables V (j) = 1/(1 \u2212 Fj(X (j))) \u2208 [1,\u221e] and\nV = (V (1), . . . , V (d)). Replacing X by V permits to take \u03b1 = 1 and b(t) = t.\n\n2.2 Classi\ufb01cation in the Extremes - Assumptions, Criterion and Optimal Elements\n\nWe place ourselves in the binary classi\ufb01cation framework recalled in the introduction. For simplicity,\nwe suppose that X takes its values in the positive orthant Rd\n+. The general aim is to build from\ntraining data in the extreme region (i.e. data points (Xi, Yi) such that (cid:107)Xi(cid:107) > tn for a large threshold\nvalue tn > 0) a classi\ufb01er gn(x) with risk Ltn (gn) de\ufb01ned in (2) being asymptotically minimum as\ntn \u2192 \u221e. In this purpose, we introduce general assumptions guaranteeing that the minimum risk\nLt(g\u2217) above level t has a limit as t \u2192 \u221e. Throughout the article, we assume that the class\ndistributions F+ and F\u2212 are heavy-tailed with same index \u03b1 = 1.\n\n3\n\n\fAssumption 1 For all \u03c3 \u2208 {\u2212, +}, the conditional distribution of X given Y = \u03c31 is regularly\nvarying with index 1 and angular measure \u03a6\u03c3(d\u03b8) (respectively, exponent measure \u00b5\u03c3(dx)): for\nA \u2282 [0,\u221e]d \\ {0} a measurable set such that 0 /\u2208 \u2202A and \u00b5(\u2202A) (cid:54)= 0,\n\ntP(cid:8)t\u22121X \u2208 A(cid:12)(cid:12) Y = \u03c3 1(cid:9) \u2212\u2212\u2212\u2192\n\nt\u2192\u221e \u00b5\u03c3(A),\n\n\u03c3 \u2208 {\u2212, +},\n\nand for B \u2282 S a measurable set,\n\n\u03a6\u03c3(B) = \u00b5\u03c3{x \u2208 Rd\n\n+ : R(x) > 1, \u0398(x) \u2208 B},\n\n\u03c3 \u2208 {\u2212, +}.\n\nUnder the hypothesis above, X\u2019s marginal distribution, given by F = pF+ + (1 \u2212 p)F\u2212, where\np = P{Y = +1} > 0, is heavy-tailed as well with index 1. Indeed, we have:\nt\u2192\u221e \u00b5(A) := p\u00b5+(A) + (1 \u2212 p)\u00b5\u2212(A).\n\ntP(cid:8)t\u22121X \u2208 A(cid:9) \u2212\u2212\u2212\u2192\n\nAnd similarly\n\n\u03a6(B) := p\u03a6+(B) + (1 \u2212 p)\u03a6\u2212(B).\n\nObserve also that the limiting class balance can be expressed using the latter asymptotic measures.\n+ : (cid:107)x(cid:107) \u2264 1} denote the positive orthant of the unit ball and let \u2126c denote\nIndeed, let \u2126 = {x \u2208 Rd\nits complementary set in Rd\n\n+. We have:\n\npt = P{Y = +1 | (cid:107)X(cid:107) > t} =\n\ntP{(cid:107)X(cid:107) > t | Y = 1} p\n\ntP{(cid:107)X(cid:107) > t}\n\n\u2212\u2212\u2212\u2192\nt\u2192\u221e p\n\n\u00b5+ (\u2126c)\n\u00b5 (\u2126c)\n\n= p\n\n\u03a6+(S)\n\u03a6(S)\n\n(3)\n\ndef\n= p\u221e.\n\nRemark 2 (ON ASSUMPTION 1) We point out that only the situation where the supposedly heavy-\ntailed class distributions F+ and F\u2212 have the same tail index is of interest. Suppose for instance\nthat the tail index \u03b1+ of F+ is strictly larger than that of F\u2212, \u03b1\u2212, that is F\u2212 has heavier tail than\nF+. In such a case F is still regularly varying with index min{\u03b1+, \u03b1\u2212} and pt \u2192 0. In this case,\none may straightforwardly see that the classi\ufb01er predicting always \u22121 on {x \u2208 Rd\n+ : (cid:107)x(cid:107) > t} is\noptimal as t increases to in\ufb01nity.\n\nRemark 3 (ON ASSUMPTION 1 (BIS)) As noticed in Remark 1, assuming that \u03b1 = 1 is not restric-\ntive when the marginal distributions are known. In practice however, they must be estimated. Due\nto space limitations, in the present analysis, we shall neglect the estimation error arising from their\nestimation. Relaxing this assumption, as made in e.g. [7], will be the subject of future work.\n\nAsymptotic criterion for classi\ufb01cation in the extremes. The goal pursued is to construct a clas-\nsi\ufb01er gn, based on the training examples Dn, minimizing the asymptotic risk in the extremes given\nby\n\nL\u221e(g) = lim sup\nt\u2192\u221e\n\u221e = inf g measurable L\u221e(g). It is immediate that any classi\ufb01er which coincides with the\n+ : (cid:107)x(cid:107) > t} is optimal w.r.t. the distribution Pt. In\n\nWe also set L\u2217\nBayes classi\ufb01er g\u2217 on the region t\u2126c = {x \u2208 Rd\nparticular g\u2217 minimizes Lt and the associated risk is\n\nLt(g).\n\n\u221e = L\u221e(g\u2217).\n\nt := Lt(g\u2217) = E [min{\u03b7(X), 1 \u2212 \u03b7(X)} | (cid:107)X(cid:107) > t] ,\nL\u2217\n\n(5)\nThus, for all classi\ufb01er g, Lt(g) \u2265 Lt(g\u2217), and taking the limit superior shows that g\u2217 minimizes\nL\u221e, that is L\u2217\nOptimality. The objective formulated above can be connected with a standard binary classi\ufb01cation\nproblem, related to a random pair (X\u221e, Y\u221e) taking its values in the limit space \u2126c \u00d7{\u22121, +1}, see\nTheorem 1 below. Let P{Y\u221e = +1} = p\u221e as in (3) and de\ufb01ne the distribution of X\u221e given that\nY\u221e = \u03c31, \u03c3 \u2208 {\u2212, +} as \u00b5\u03c3(\u2126c)\u22121\u00b5\u03c3(\u00b7 ). Then for A \u2282 \u2126c, using (3),\n\nt > 0.\n\n(4)\n\nP{X\u221e \u2208 A, Y\u221e = +1} =\n\np\u221e\u00b5+(A)\n\u00b5+(\u2126c)\n\n=\n\np\u00b5+(A)\n\u00b5(\u2126c)\n\n=\n\np limt tP{X \u2208 tA | Y = +1}\n\nlimt tP{X \u2208 t\u2126c}\n\n= lim\nt\u2192\u221e\n\nP{X \u2208 tA, Y = +1 | (cid:107)X(cid:107) > t} .\n\n4\n\n\fWe denote by P\u221e the joint distribution of (X\u221e, Y\u221e) thus de\ufb01ned. As shall be seen below, under\nappropriate and natural assumptions, classi\ufb01ers with minimum asymptotic risk in the extremes are\nin 1-to-1 correspondence with solutions of the binary classi\ufb01cation problem related to (X\u221e, Y\u221e).\nLet \u03c1 be a common dominating measure for \u03a6\u2212, \u03a6+ on S (\u03c1 does not need to be the Lebesgue\nmeasure, take e.g. \u03c1 = \u03a6+ + \u03a6\u2212). Then denote by \u03d5+, \u03d5\u2212 respectively the densities of \u03a6+, \u03a6\u2212\nw.r.t. \u03c1. By homogeneity of \u00b5+, \u00b5\u2212, the conditional distribution of Y\u221e given X\u221e = x is\n\n\u03b7\u221e(x)\n\ndef\n\n= P{Y\u221e = 1 | X\u221e = x} =\n\np\u221e\u03d5+(\u0398(x))/\u03a6+(S)\n\np\u221e\u03d5+(\u0398(x))/\u03a6+(S) + (1 \u2212 p\u221e)\u03d5\u2212(\u0398(x))/\u03a6\u2212(S)\n\n=\n\np\u03d5+(\u0398(x))\n\np\u03d5+(\u0398(x)) + (1 \u2212 p)\u03d5\u2212(\u0398(x))\n\n.\n\nNotice that \u03b7\u221e is independent of the chosen reference measure \u03c1 and that \u03b7\u221e is constant along rays,\nthat is \u03b7\u221e(tx) = \u03b7\u221e(x) for (t, x) such that min((cid:107)tx(cid:107),(cid:107)x(cid:107)) \u2265 1. The optimal classi\ufb01er for the\nrandom pair (X\u221e, Y\u221e) with respect to the classical risk LP\u221e is clearly\n\n\u221e(x) = 21{\u03b7\u221e(x) \u2265 1/2} \u2212 1.\ng\u2217\n\nAgain g\u2217\n\u03b7\u221e(x) = \u03b7\u221e(\u0398(x)). The minimum classi\ufb01cation error is\n\n\u221e is constant along rays on \u2126c and is thus a function of \u0398(x) only. We abusively denote\n\nL\u2217\nP\u221e = LP\u221e(g\u2217\n\n\u221e) = E [min{\u03b7\u221e(\u0398\u221e), 1 \u2212 \u03b7\u221e(\u0398\u221e)}] ,\n\n(6)\nwhere \u0398\u221e = \u0398(X\u221e). More generally, observe that any class GS of classi\ufb01ers g : \u03b8 \u2208 S (cid:55)\u2192 g(\u03b8) \u2208\n+ (cid:55)\u2192 g(\u0398(x)), that shall still be denoted by GS\n{\u22121, +1} de\ufb01nes a class of classi\ufb01ers on Rd\nfor simplicity. The next result claims that, under the regularity hypothesis stated below, the classi\ufb01er\ng\u2217\n\u221e is optimal for the asymptotic risk in the extremes, that is L\u221e(g\u2217\n\u221e) = inf g L\u221e(g). We shall also\nprove that L\u221e(g\u2217\n\n+, x \u2208 Rd\n\n\u221e) = L\u2217\n\nP\u221e.\n\nAssumption 2 (UNIFORM CONVERGENCE ON THE SPHERE OF \u03b7(tx)) The limiting regression\nfunction \u03b7\u221e is continuous on S and\n\n|\u03b7(\u0398(t\u03b8)) \u2212 \u03b7\u221e(\u03b8)| \u2212\u2212\u2212\u2192\nt\u2192\u221e 0\n\nsup\n\u03b8\u2208S\n\nRemark 4 (ON ASSUMPTION 2) By invariance of \u03b7\u221e along rays, Assumption 2 is equivalent to\n\n{x\u2208Rd\n\nsup\n+:(cid:107)x(cid:107)\u2265t}\n\n|\u03b7(x) \u2212 \u03b7\u221e(x)| \u2212\u2212\u2212\u2192\n\nt\u2192\u221e 0.\n\nAssumption 2 is satis\ufb01ed whenever the probability densities f+, f\u2212 of F+, F\u2212 are continuous, reg-\nularly varying with limit functions q+, q\u2212, and when the convergence is uniform, that is if\n\n|td+1f\u03c3(tx) \u2212 q\u03c3(x)| = 0,\n\n\u03c3 \u2208 {+,\u2212}.\n\n(7)\n\nt\u2192\u221e sup\nlim\nx\u2208S\n\nIn such a case q+, q\u2212 are respectively the densities of \u00b5+, \u00b5\u2212 with respect to the Lebesgue measure\nand are continuous, which implies the continuity of \u03d5+, \u03d5\u2212. The latter uniform convergence as-\nsumption is introduced in [4] and is used e.g. in [2] in the context of minimum level sets estimation.\n\nTheorem 1 (OPTIMAL CLASSIFIERS IN THE EXTREMES) Under Assumptions 1 and 2,\n\nt \u2212\u2212\u2212\u2192\nL\u2217\nt\u2192\u221e L\u2217\n\nP\u221e.\n\n(8)\n\nHence, we have: L\u2217\nextremes:\n\n\u221e = L\u2217\n\ninf\n\ng measurable\n\nP\u221e. In addition, the classi\ufb01er g\u2217\nL\u221e(g) = L\u221e(g\u2217\n\n\u221e) = E{min(\u03b7\u221e(\u0398\u221e), 1 \u2212 \u03b7\u221e(\u0398\u221e))} .\n\n\u221e minimizes the asymptotic risk in the\n\nRefer to the Supplementary Material for the technical proof. Theorem 1 gives us the form of the\noptimal classi\ufb01er in the extremes g\u2217\n\u221e(\u0398(x)), which depends only on the angular compo-\nnent \u0398(x), not the norm R(x). This naturally leads to applying the ERM principle to a collection of\nclassi\ufb01ers of the form g(x) = g(\u0398(x)) on the domain {x \u2208 Rd\n+ : (cid:107)x(cid:107) > t} for t > 0 large enough.\nThe next section provides statistical guarantees for this approach.\n\n\u221e(x) = g\u2217\n\n5\n\n\f3 Empirical Risk Minimization in the Extremes\nConsider a class GS of classi\ufb01ers g : \u03b8 \u2208 S (cid:55)\u2192 g(\u03b8) \u2208 {\u22121, +1} on the sphere S. It also de\ufb01nes a\n+, namely {g(\u0398(x)) : g \u2208 GS}, which we denote by GS for simplicity.\ncollection of classi\ufb01ers on Rd\nSorting the training observations by decreasing order of magnitude, we introduce the order statistics\n(cid:107)X(1)(cid:107) > . . . > (cid:107)X(n)(cid:107) and we denote by Y(i) the corresponding sorted labels. Fix a small fraction\n\u03c4 > 0 of extreme observations, and let t\u03c4 be the quantile at level (1 \u2212 \u03c4 ) of the r.v. (cid:107)X(cid:107): P{(cid:107)X(cid:107) >\nt\u03c4} = \u03c4. Set k = (cid:98)n\u03c4(cid:99) and consider the empirical risk\n\nk(cid:88)\n\n(cid:98)Lk(g) =\n\n1\nk\n\n1{Y(i) (cid:54)= g(\u0398(X(i)))} = L(cid:98)Pk\n\n(g),\n\n(9)\n\n(10)\n\nwhere (cid:98)Pk denotes the empirical distribution of the truncated training sample {(Xi, Yi) : (cid:107)Xi(cid:107) \u2265\n(cid:107)Xk(cid:107),\n. . . , n}}, the statistical version of the conditional distribution Pt\u03c4 . We now\ninvestigate the performance in terms of asymptotic risk in the extremes L\u221e of the solutions of the\nminimization problem\n\ni \u2208 {1,\n\ni=1\n\n(cid:98)Lk(g).\n\nmin\ng\u2208GS\n\nThe practical issue of designing ef\ufb01cient algorithms for solving (10) is beyond the scope of this\npaper. Focus is here on the study of the learning principle that consists in assigning to any very\nlarge input value x the likeliest label based on the direction \u0398(x) it de\ufb01nes only (the construction\nis summarized in Algorithm 1 below). The following result provides an upper bound for the excess\nof classi\ufb01cation error in the domain t\u03c4 \u2126c of solutions of (10). Its proof, which relies on a maximal\ndeviation inequality tailored to low probability regions, is given in the Supplementary Material.\n\nof (10). Recall k = (cid:98)n\u03c4(cid:99). Then, for \u03b4 \u2208 (0, 1), \u2200n \u2265 1, we have with probability larger than 1 \u2212 \u03b4:\n\nTheorem 2 Suppose that the class GS is of \ufb01nite VC dimension VGS < +\u221e. Let(cid:98)gk be any solution\nLt\u03c4 ((cid:98)gk) \u2212 L\u2217\n(cid:27)\n\n(cid:16)(cid:112)2(1 \u2212 \u03c4 ) log(2/\u03b4) + C(cid:112)VGS log(1/\u03b4)\n5 + 2 log(1/\u03b4) +(cid:112)log(1/\u03b4)(C(cid:112)VGS +\n\nLt\u03c4 (g) \u2212 L\u2217\n\n\u2264 1\u221a\nk\n\n(cid:17)\n(cid:17)\n\n(cid:26)\n\n(cid:16)\n\n\u221a\n\nt\u03c4\n\n,\n\nt\u03c4\n\n2)\n\n+\n\ninf\ng\u2208GS\n\n+\n\n1\nk\n\nwhere C is a constant independent from n, \u03c4 and \u03b4.\nRemark 5 (ON MODEL SELECTION) Selecting an appropriate model class GS is a crucial issue\nin machine-learning. Following in the footsteps of structured risk minimization, one may use a VC\n\nbound for E[supg\u2208GS |(cid:98)Lk(g) \u2212 E[(cid:98)Lk(g)]|] as a complexity regularization term to penalize in an\n\nadditive fashion the empirical risk (9). Considering a collection of such models, oracle inequalities\nguaranteeing the quasi-optimality of the rule minimizing the penalized empirical risk can be then\nclassically established by means of a slight modi\ufb01cation of the argument of Theorem 2\u2019s proof, see\ne.g. Chapter 18 in [5].\n\n\u221a\nThe upper bound stated above shows that the learning rate is of order OP(1/\nk), where k is the\nactual size of the training data set used to perform approximate empirical risk minimization in the\nextremes. As revealed by the corollary below, this strategy permits to build a consistent sequence\nof classi\ufb01ers for the L\u221e-risk, when the fraction \u03c4 = \u03c4n decays at an appropriate rate (provided that\nthe model bias can be neglected of course).\n\nCorollary 1 Suppose that the assumptions of Theorems 1-2 are ful\ufb01lled. In addition, assume that\nthe model bias asymptotically vanishes as \u03c4 \u2192 0, i.e.\n\nThen, as soon as k \u2192 +\u221e as n \u2192 \u221e, the sequence of classi\ufb01ers ((cid:98)gk) is consistent in the extremes,\n\nt\u03c4\n\nmeaning that we have the convergence in probability:\n\ninf\ng\u2208GS\n\nLt\u03c4 (g) \u2212 L\u2217\n\n\u2212\u2192 0\n\nas \u03c4 \u2192 0.\n\nL\u221e ((cid:98)gk) \u2192 L\u2217\n\n\u221e as n \u2192 \u221e.\n\n6\n\n\fAlgorithm 1 (ERM in the extremes)\n\nInput Training dataset Dn = {(X1, Y1), . . . , (Xn, Yn)}, collection GS of classi\ufb01ers on\nthe sphere, size k \u2264 n of the training set composed of extreme observations\n\n1 Standardization. Standardize the input vector by applying the rank-transformation: \u2200i \u2208\n\n{1, . . . , n}, \u02c6Vi = \u02c6T (Xi), where\n\u02c6T (x) =\nfor all x = (x1, . . . , xd) \u2208 Rd.\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n1 \u2212 \u02c6Fj(xj)\n\n1/\n\n,\n\nj=1, ..., d\n\n2 Truncation. Sort the training input observations by decreasing order of magnitude\n\nand consider the set of extreme training points\n\n3 Optimization. Compute a solution(cid:98)gk(\u03b8) of the minimization problem\n\n( \u02c6V(1), Y(1)), . . . , ( \u02c6V(k), Y(k))\n\n.\n\n(cid:111)\n\n(cid:17)(cid:111)\n\n(cid:107) \u02c6V(1)(cid:107) \u2265 . . . \u2265 (cid:107) \u02c6V(n)(cid:107),\n\n(cid:110)\n\n(cid:16)\n\nY(i) (cid:54)= g\n\n1\n\n\u0398( \u02c6V(i))\n\n(cid:110)\n\nmin\ng\u2208GS\n\n1\nk\n\nk(cid:88)\n(cid:17)(cid:17)\n(cid:16) \u02c6T (x)\n\ni=1\n\nOutput The classi\ufb01er(cid:98)gk\n\n(cid:16)\n\n\u0398\n\n, applicable on the region {x : (cid:107) \u02c6T (x)(cid:107) > (cid:107) \u02c6V(k)(cid:107)}.\n\nRemark 6 (Choice of k) Determining the best value of k is a typical challenge of Extreme Value\nanalysis. This is typically a bias/variance trade-off, too large values introduce a bias by taking into\naccount observations which are not large enough, so that their distribution deviates signi\ufb01cantly\nfrom the limit distribution of extremes. On the other hand, too small values obviously increase the\nvariance of the classi\ufb01er. See e.g.[6] or[7] and the reference therein for a discussion. In practice a\npossible default choice is k =\n\nn, otherwise cross-validation can be performed.\n\n\u221a\n\ni\n\n4\n\nIllustrative Numerical Experiments\n\nThe purpose of our experiments is to provide insights into the performance of the classi\ufb01er(cid:98)gk on\nextreme regions constructed via Algorithm 1.The training set is ordered as in Step 1 of Algorithm 1.\n(k) )(cid:107), the L1 norm is used throughout our experiments. The extreme\nFor a chosen k, let t = (cid:107) \u02c6T (Xtrain\nrisk in the extremes L\u221e((cid:98)gk) and illustrate the generalization ability of the proposed classi\ufb01er in the\n)(cid:107) > t. To approximate of the asymptotic\ntest set T is the subset of test points such that (cid:107) \u02c6T (Xtest\nextreme region, we consider decreasing subsets of T . Namely denoting ntest = |T |, we keep only\n)(cid:107), for decreasing values of \u03ba \u2208 (0, 1]. This\nthe (cid:98)\u03bantest(cid:99) largest instances of T in terms on (cid:107) \u02c6T (Xtest\nexperimental framework is summarized in Figure 1, where \u03bbt = (cid:107) \u02c6T (Xtest\nWe consider two different classi\ufb01cation algorithms for Step 3 in Algorithm 1, namely random forest\n(RF) and k-nearest neighbors (k-NN), which correspond to two different classes GS of classi\ufb01ers.\n\nFor each class GS, the performance of (cid:98)gk (which considers only the direction \u0398( \u02c6T (x)) of both\n\ntraining and testing data, in other words classi\ufb01es the projected datasets onto the unit sphere (see\nFigure 2 ) is compared with that of the classical version of the algorithm (RF or k-NN) taking as input\nthe same training data but without the standardization and truncation steps neither the projection\nonto the unit sphere. Figures 4 and 5 summarize the results obtained using RF respectively with a\nmultivariate simulated dataset and with a real world dataset. The simulated dataset is generated from\na logistic distribution as described in [13]. The positive and negative instances are generated using\n\n((cid:98)\u03bantest(cid:99)))(cid:107) \u2265 t.\n\ni\n\n7\n\n\fFigure 1: Train set (dotted area) and test set (col-\nored area).\n\nFigure 2: Colored cones correspond to a given label\nfrom the classi\ufb01er on the simplex.\n\ntwo different dependency parameters. An example of dataset thus obtained is displayed in Figure 3.\nWe report the results obtained with 5 \u00b7 103 points for each label for the train set and 5 \u00b7 104 points\nfor each label for the test set. k = 100 and \u03ba \u2208 [1, 0.3]. the number of trees for both random forests\n(in the regular setting and in the setting of Algorithm 1) is set to 200. The number of neighbors for\nboth k-NN\u2019s is set to 5.\n\nFigure 3: Toy dataset generated from a multivariate logistic distribution projected on R2.\n\nThe real dataset known as Ecoli dataset, introduced in [9], deals with protein localization and con-\ntains 336 instances and 8 features. The Supplementary Material gathers additional details concerning\nthe datasets and the tuning of RF and k-NN in our experiments, as well as additional results obtained\nwith the above described datasets and with a simulated dataset from a different distribution.\n\nFigure 4: Logistic data - test loss of RF on the\nsimplex and regular RF depending on the multi-\nplicative factor \u03ba.\n\nFigure 5: Real data - test loss of RF on the sim-\nplex and regular RF depending on the multiplica-\ntive factor \u03ba.\n\nFigure 4 shows the evolutions of the Hamming losses with decreasing values of \u03ba \u2208 [0.3, 1]. The\nboxplots display the losses obtained with 10 independently simulated datasets. For the experiment\non the Ecoli dataset (Figure 5), one third of the dataset is used as a test set and the rest corresponds\nto the train set. k = 100 and \u03ba \u2208 [0.3, 1] (considering smaller values of \u03ba was prevented by data\nscarcity). The boxplots display the results for different (random) partitions of the data into a train\nand a test set. In both examples, the loss of the regular classi\ufb01er is worse (and even increases) when\n\u03ba decreases whereas the classi\ufb01er resulting from the proposed approach is better and has a better\nextrapolation ability.\n\n8\n\n\u03c4\u03c4\u03bb\u03c4Trainregion\u03bb\u03c4TrainregionTestregion\u03c4\u03c4\u03bb\u03c4\u03bb\u03c4020406080100120020406080100120140labeled -1labeled +101.01.00.950.950.90.90.850.850.80.80.750.750.70.70.650.650.60.60.550.550.50.50.450.450.40.40.350.350.30.3Values of the multiplicative factor0.050.100.150.20Hamming LossRegular RFRF on the Simplex01.01.00.950.950.90.90.850.850.80.80.750.750.70.70.650.650.60.60.550.550.50.50.450.450.40.40.350.350.30.3Values of the multiplicative factor0.00.10.20.30.4Hamming LossRegular RFRF on the Simplex\f5 Conclusion\n\nIn various applications (e.g. safety/security, \ufb01nance, insurance, environmental sciences), it is of\nprime importance to predict the response Y of a system when it is impacted by shocks, correspond-\ning to extremely large input values X. In this paper, we have developed a rigorous probabilistic\nframework for binary classi\ufb01cation in extreme regions, relying on the (nonparametric) theory of\nregularly varying random vectors, and proved the accuracy of the ERM approach in this context,\nwhen the risk functional is computed from extreme observations only. The present contribution may\nopen a new line of research, insofar as progress can be naturally expected in the design of algorithmic\nlearning methods tailored to extreme points (or their projection onto the unit sphere) and statistical\nissues such as estimation of the minimum risk in the extremes, L\u2217\n\n\u221e, remain to be addressed.\n\nReferences\n[1] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses.\n\nAnn. Statist., 43(6):2507\u20132536, 2015.\n\n[2] J.J. Cai, J.H.J. Einmahl, and L. De Haan. Estimation of extreme risk regions under multivariate\n\nregular variation. The Annals of Statistics, pages 1803\u20131826, 2011.\n\n[3] A. Carpentier and M. Valko. Extreme bandits. In Advances in Neural Information Processing\n\nSystems 27, pages 1089\u20131097. Curran Associates, Inc., 2014.\n\n[4] L. De Haan and S. Resnick. On regular variation of probability densities. Stochastic processes\n\nand their applications, 25:83\u201393, 1987.\n\n[5] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applica-\ntions of mathematics : stochastic modelling and applied probability. U.S. Government Printing\nOf\ufb01ce, 1996.\n\n[6] N. Goix, A. Sabourin, and S. Cl\u00b4emenc\u00b8on. Sparse representation of multivariate extremes with\n\napplications to anomaly ranking. In Arti\ufb01cial Intelligence and Statistics, pages 75\u201383, 2016.\n\n[7] N. Goix, A. Sabourin, and S. Cl\u00b4emenc\u00b8on. Sparse representation of multivariate extremes with\n\napplications to anomaly detection. Journal of Multivariate Analysis, 161:12\u201331, 2017.\n\n[8] S. Mendelson. Learning without concentration for general loss functions. Probability Theory\n\nand Related Fields, 171(1):459\u2013502, 2018.\n\n[9] K. Nakai and M. Kanehisa. A knowledge base for predicting protein localization sites in\n\neukaryotic cells. Genomics, 14(4):897\u2013911, 1992.\n\n[10] Mesrob I Ohannessian and Munther A Dahleh. Rare probability estimation under regularly\n\nvarying heavy tails. In Conference on Learning Theory, pages 21\u20131, 2012.\n\n[11] S. Resnick. Extreme Values, Regular Variation, and Point Processes. Springer Series in Oper-\n\nations Research and Financial Engineering, 1987.\n\n[12] Teemu Roos, Peter Gr\u00a8unwald, Petri Myllym\u00a8aki, and Henry Tirri. Generalization to unseen\nIn Y. Weiss, B. Sch\u00a8olkopf, and J. C. Platt, editors, Advances in Neural Information\n\ncases.\nProcessing Systems 18, pages 1129\u20131136. MIT Press, 2006.\n\n[13] A. Stephenson. Simulating multivariate extreme value distributions of logistic type. Extremes,\n\n6(1):49\u201359, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1594, "authors": [{"given_name": "Hamid", "family_name": "JALALZAI", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Stephan", "family_name": "Cl\u00e9men\u00e7on", "institution": "Telecom ParisTech"}, {"given_name": "Anne", "family_name": "Sabourin", "institution": "LTCI, Telecom ParisTech, Universit\u00e9 Paris-Saclay"}]}