{"title": "Variable margin losses for classifier design", "book": "Advances in Neural Information Processing Systems", "page_first": 1576, "page_last": 1584, "abstract": "The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter.", "full_text": "Variable margin losses for classi\ufb01er design\n\nHamed Masnadi-Shirazi\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\n\nhmasnadi@ucsd.edu\n\nAbstract\n\nNuno Vasconcelos\n\nStatistical Visual Computing Laboratory,\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92039\nnuno@ucsd.edu\n\nThe problem of controlling the margin of a classi\ufb01er is studied. A detailed an-\nalytical study is presented on how properties of the classi\ufb01cation risk, such as\nits optimal link and minimum risk functions, are related to the shape of the loss,\nand its margin enforcing properties. It is shown that for a class of risks, denoted\ncanonical risks, asymptotic Bayes consistency is compatible with simple analyti-\ncal relationships between these functions. These enable a precise characterization\nof the loss for a popular class of link functions. It is shown that, when the risk is\nin canonical form and the link is inverse sigmoidal, the margin properties of the\nloss are determined by a single parameter. Novel families of Bayes consistent loss\nfunctions, of variable margin, are derived. These families are then used to design\nboosting style algorithms with explicit control of the classi\ufb01cation margin. The\nnew algorithms generalize well established approaches, such as LogitBoost. Ex-\nperimental results show that the proposed variable margin losses outperform the\n\ufb01xed margin counterparts used by existing algorithms. Finally, it is shown that\nbest performance can be achieved by cross-validating the margin parameter.\n\n1\n\nIntroduction\n\nOptimal classi\ufb01ers minimize the expected value of a loss function, or risk. Losses commonly used\nin machine learning are upper-bounds on the zero-one classi\ufb01cation loss of classical Bayes decision\ntheory. When the resulting classi\ufb01er converges asymptotically to the Bayes decision rule, as training\nsamples increase, the loss is said to be Bayes consistent. Examples of such losses include the hinge\nloss, used in SVM design, the exponential loss, used by boosting algorithms such as AdaBoost,\nor the logistic loss, used in both classical logistic regression and more recent methods, such as\nLogitBoost. Unlike the zero-one loss, these losses assign a penalty to examples correctly classi\ufb01ed\nbut close to the boundary. This guarantees a classi\ufb01cation margin, and improved generalization\nwhen learning from \ufb01nite datasets [1]. Although the connections between large-margin classi\ufb01cation\nand classical decision theory have been known since [2], the set of Bayes consistent large-margin\nlosses has remained small. Most recently, the design of such losses has been studied in [3]. By\nestablishing connections to the classical literature in probability elicitation [4], this work introduced\na generic framework for the derivation of Bayes consistent losses. The main idea is that there are\nthree quantities that matter in risk minimization: the loss function \u03c6, a corresponding optimal link\nfunction f \u2217\n\u03c6, which maps posterior class probabilities to classi\ufb01er predictions, and a minimum risk\nC \u2217\n\n\u03c6, associated with the optimal link.\n\n\u03c6, [3] showed that there is an alternative: to specify f \u2217\n\nWhile the standard approach to classi\ufb01er design is to de\ufb01ne a loss \u03c6, and then optimize it to obtain\n\u03c6, and analytically derive the\n\u03c6 and C \u2217\nf \u2217\nloss \u03c6. The advantage is that this makes it possible to manipulate the properties of the loss, while\nguaranteeing that it is Bayes consistent. The practical relevance of this approach is illustrated in [3],\nwhere a Bayes consistent robust loss is derived, for application in problems involving outliers. This\n\n\u03c6 and C \u2217\n\n1\n\n\floss is then used to design a robust boosting algorithm, denoted SavageBoost. SavageBoost has been,\nmore recently, shown to outperform most other boosting algorithms in computer vision problems,\nwhere outliers are prevalent [5]. The main limitation of the framework of [3] is that it is not totally\nconstructive. It turns out that many pairs (C \u2217\n\u03c6) are compatible with any Bayes consistent loss \u03c6.\nFurthermore, while there is a closed form relationship between \u03c6 and (C \u2217\n\u03c6), this relationship is\nfar from simple. This makes it dif\ufb01cult to understand how the properties of the loss are in\ufb02uenced\nby the properties of either C \u2217\n\u03c6. In practice, the design has to resort to trial and error, by 1)\ntesting combinations of the latter and, 2) verifying whether the loss has the desired properties. This\nis feasible when the goal is to enforce a broad loss property, e.g. that a robust loss should be bounded\nfor negative margins [3], but impractical when the goal is to exercise a \ufb01ner degree of control.\n\n\u03c6 or f \u2217\n\n\u03c6,f \u2217\n\n\u03c6,f \u2217\n\n\u03c6,f \u2217\n\n\u03c6,f \u2217\n\n\u03c6 or f \u2217\n\n\u03c6 is a sigmoidal function, i.e. f \u2217\n\nIn this work, we consider one such problem: how to control the size of the margin enforced by the\nloss. We start by showing that, while many pairs (C \u2217\n\u03c6) are compatible with a given \u03c6, one of these\npairs establishes a very tight connection between the optimal link and the minimum risk: that f \u2217\n\u03c6 is\n\u03c6. We refer to the risk function associated with such a pair as a canonical risk,\nthe derivative of C \u2217\nand show that it leads to an equally tight connection between the pair (C \u2217\n\u03c6) and the loss \u03c6. For\na canonical risk, all three functions can be obtained from each other with one-to-one mappings of\ntrivial analytical tractability. This enables a detailed analytical study of how C \u2217\n\u03c6 affect \u03c6. We\nconsider the case where the inverse of f \u2217\n\u03c6 is inverse-sigmoidal, and\nshow that this strongly constrains the loss. Namely, the latter becomes 1) convex, 2) monotonically\ndecreasing, 3) linear for large negative margins, and 4) constant for large positive margins. This\nimplies that, for a canonical risk, the choice of a particular link in the inverse-sigmoidal family\nonly impacts the behavior of \u03c6 around the origin, i.e. the size of the margin enforced by the loss.\nThis quantity is then shown to depend only on the slope of the sigmoidal inverse-link at the origin.\nSince this property can be controlled by a single parameter, the latter becomes a margin-tunning\nparameter, i.e. a parameter that determines the margin of the optimal classi\ufb01er. This is exploited to\ndesign parametric families of loss functions that allow explicit control of the classi\ufb01cation margin.\nThese losses are applied to the design of novel boosting algorithms of tunable margin. Finally,\nit is shown that the requirements of 1) a canonical risk, and 2) an inverse-sigmoidal link are not\nunduly restrictive for classi\ufb01er design. In fact, approaches like logistic regression or LogitBoost\nare special cases of the proposed framework. A number of experiments are conducted to study the\neffect of margin-control on the classi\ufb01cation accuracy. It is shown that the proposed variable-margin\nlosses outperform the \ufb01xed-margin counterparts used by existing algorithms. Finally, it is shown that\ncross-validation of the margin parameter leads to classi\ufb01ers with the best performance on all datasets\ntested.\n\n2 Loss functions for classi\ufb01cation\n\nWe start by brie\ufb02y reviewing the theory of Bayes consistent classi\ufb01er design. See [2, 6, 7, 3] for\nfurther details. A classi\ufb01er h maps a feature vector x \u2208 X to a class label y \u2208 {\u22121, 1}. This\nmapping can be written as h(x) = sign[p(x)] for some function p : X \u2192 R, which is denoted\nas the classi\ufb01er predictor. Feature vectors and class labels are drawn from probability distributions\nPX(x) and PY (y) respectively. Given a non-negative loss function L(x, y), the classi\ufb01er is optimal\nif it minimizes the risk R(f ) = EX,Y [L(h(x), y)]. This is equivalent to minimizing the conditional\nrisk EY |X[L(h(x), y)|X = x] for all x \u2208 X .\nIt is useful to express p(x) as a composition of\ntwo functions, p(x) = f (\u03b7(x)), where \u03b7(x) = PY |X(1|x), and f : [0, 1] \u2192 R is a link function.\nClassi\ufb01ers are frequently designed to be optimal with respect to the zero-one loss\n\nL0/1(f, y) =\n\n1 \u2212 sign(yf )\n\n2\n\n= (cid:26) 0,\n\n1,\n\nif y = sign(f );\nif y 6= sign(f ),\n\nwhere we omit the dependence on x for notational simplicity. The associated conditional risk is\n\nC0/1(\u03b7, f ) = \u03b7\n\n1 \u2212 sign(f )\n\n2\n\n+ (1 \u2212 \u03b7)\n\n1 + sign(f )\n\n2\n\n= (cid:26) 1 \u2212 \u03b7,\n\n\u03b7,\n\nif f \u2265 0;\nif f < 0.\n\nThe risk is minimized if\n\nf (x) > 0\nf (x) = 0\nf (x) < 0\n\nif \u03b7(x) > 1\n2\nif \u03b7(x) = 1\n2\nif \u03b7(x) < 1\n2\n\n\uf8f1\uf8f2\n\uf8f3\n\n2\n\n(1)\n\n(2)\n\n(3)\n\n\fTable 1: Loss \u03c6, optimal link f \u2217\nfor popular learning algorithms.\n\n\u03c6(\u03b7), optimal inverse link [f \u2217\n\n\u03c6]\u22121(v) , and minimum conditional risk C \u2217\n\n\u03c6(\u03b7)\n\nAlgorithm\n\nSVM\n\nBoosting\n\n\u03c6(v)\n\nf \u2217\n\u03c6(\u03b7)\n\nmax(1 \u2212 v, 0)\n\nsign(2\u03b7 \u2212 1)\n\nexp(\u2212v)\n\n[f \u2217\n\n\u03c6]\u22121(v)\nNA\ne2v\n\n1+e2v\n\nev\n\n1+ev\n\nC \u2217\n\n\u03c6(\u03b7)\n\n1 \u2212 |2\u03b7 \u2212 1|\n\n2p\u03b7(1 \u2212 \u03b7)\n\n-\u03b7 log \u03b7 \u2212 (1 \u2212 \u03b7) log(1 \u2212 \u03b7)\n\n1\n\n2 log \u03b7\nlog \u03b7\n1\u2212\u03b7\n\n1\u2212\u03b7\n\nLogistic Regression\n\nlog(1 + e\u2212v)\n\nExamples of optimal link functions include f \u2217 = 2\u03b7 \u2212 1 and f \u2217 = log \u03b7\n1\u2212\u03b7 . The associated optimal\nclassi\ufb01er h\u2217 = sign[f \u2217] is the well known Bayes decision rule (BDR), and the associated minimum\nconditional (zero-one) risk is\n\nC \u2217\n\n0/1(\u03b7) = \u03b7(cid:18) 1\n\n2\n\n\u2212\n\n1\n2\n\nsign(2\u03b7 \u2212 1)(cid:19) + (1 \u2212 \u03b7)(cid:18) 1\n\n2\n\n+\n\n1\n2\n\nsign(2\u03b7 \u2212 1)(cid:19) .\n\n(4)\n\nA loss which is minimized by the BDR is Bayes consistent. A number of Bayes consistent alter-\nnatives to the 0-1 loss are commonly used. These include the exponential loss of boosting, the log\nloss of logistic regression, and the hinge loss of SVMs. They have the form L\u03c6(f, y) = \u03c6(yf ), for\ndifferent functions \u03c6. These functions assign a non-zero penalty to small positive yf, encouraging\nthe creation of a margin, a property not shared by the 0-1 loss. The resulting large-margin classi\ufb01ers\nhave better generalization than those produced by the latter [1]. The associated conditional risk\n\nis minimized by the link\n\nC\u03c6(\u03b7, f ) = \u03b7\u03c6(f ) + (1 \u2212 \u03b7)\u03c6(\u2212f ).\n\nf \u2217\n\u03c6(\u03b7) = arg min\n\nC\u03c6(\u03b7, f )\n\n(5)\n\n(6)\n\nf\nleading to the minimum conditional risk function C \u2217\n\u03c6(\u03b7) = C\u03c6(\u03b7, f \u2217\nlink, and minimum risk of some of the most popular classi\ufb01er design methods.\n\n\u03c6). Table 1 lists the loss, optimal\n\nConditional risk minimization is closely related to classical probability elicitation in statistics [4].\nHere, the goal is to \ufb01nd the probability estimator \u02c6\u03b7 that maximizes the expected reward\n\nI(\u03b7, \u02c6\u03b7) = \u03b7I1(\u02c6\u03b7) + (1 \u2212 \u03b7)I\u22121(\u02c6\u03b7),\n\n(7)\n\nwhere I1(\u02c6\u03b7) is the reward for prediction \u02c6\u03b7 when event y = 1 holds and I\u22121(\u02c6\u03b7) the corresponding\nreward when y = \u22121. The functions I1(\u00b7), I\u22121(\u00b7) should be such that the expected reward is\nmaximal when \u02c6\u03b7 = \u03b7, i.e.\n\nI(\u03b7, \u02c6\u03b7) \u2264 I(\u03b7, \u03b7) = J(\u03b7), \u2200\u03b7\n\n(8)\n\nwith equality if and only if \u02c6\u03b7 = \u03b7. The conditions under which this holds are as follows.\nTheorem 1.\n2) (8) holds if and only if\n\n[4] Let I(\u03b7, \u02c6\u03b7) and J(\u03b7) be as de\ufb01ned in (7) and (8). Then 1) J(\u03b7) is convex and\n\nI1(\u03b7) = J(\u03b7) + (1 \u2212 \u03b7)J \u2032(\u03b7)\nI\u22121(\u03b7) = J(\u03b7) \u2212 \u03b7J \u2032(\u03b7).\n\n(9)\n(10)\n\nHence, starting from any convex J(\u03b7), it is possible to derive I1(\u00b7), I\u22121(\u00b7) so that (8) holds. This\nenables the following connection to risk minimization.\nTheorem 2. [3] Let J(\u03b7) be as de\ufb01ned in (8) and f a continuous function. If the following proper-\nties hold\n\n1. J(\u03b7) = J(1 \u2212 \u03b7),\n\n2. f is invertible with symmetry\n\nf \u22121(\u2212v) = 1 \u2212 f \u22121(v),\n\n(11)\n\n3\n\n\fthen the functions I1(\u00b7) and I\u22121(\u00b7) derived with (9) and (10) satisfy the following equalities\n\nI1(\u03b7) = \u2212\u03c6(f (\u03b7))\n\nI\u22121(\u03b7) = \u2212\u03c6(\u2212f (\u03b7)),\n\nwith\n\n\u03c6(v) = \u2212J[f \u22121(v)] \u2212 (1 \u2212 f \u22121(v))J \u2032[f \u22121(v)].\n\n(12)\n(13)\n\n(14)\n\nUnder the conditions of the theorem, I(\u03b7, \u02c6\u03b7) = \u2212C\u03c6(\u03b7, f ). This establishes a new path for classi\ufb01er\ndesign [3]. Rather than specifying a loss \u03c6 and minimizing C\u03c6(\u03b7, f ), so as to obtain whatever\noptimal link f \u2217\n\u03c6(\u03b7)\nand derive, from (14) with J(\u03b7) = \u2212C \u2217\n\u03c6(\u03b7), the underlying loss \u03c6. The main advantage is the ability\nto control directly the quantities that matter for classi\ufb01cation, namely the predictor and risk of the\noptimal classi\ufb01er.The only conditions are that C \u2217\n\n\u03c6(\u03b7) results, it is possible to specify f \u2217\n\n\u03c6 and minimum expected risk C \u2217\n\n\u03c6 and C \u2217\n\n\u03c6(\u03b7) = C \u2217\n\n\u03c6(1 \u2212 \u03b7) and (11) holds for f \u2217\n\u03c6.\n\n3 Canonical risk minimization\n\n\u03c6(\u03b7), there are multiple pairs (\u03c6, f \u2217\n\nIn general, given J(\u03b7) = \u2212C \u2217\n\u03c6) that satisfy (14). Hence, speci-\n\ufb01cation of either the minimum risk or optimal link does not completely characterize the loss. This\nmakes it dif\ufb01cult to control some important properties of the latter, such as the margin. In this work,\nwe consider an important special case, where such control is possible. We start with a lemma that\nrelates the symmetry conditions, on J(\u03b7) and f \u2217\nLemma 3. Let J(\u03b7) be a strictly convex and differentiable function such that J(\u03b7) = J(1 \u2212 \u03b7).\nThen J \u2032(\u03b7) is invertible and\n\n\u03c6(\u03b7), of Theorem 2.\n\n[J \u2032]\u22121(\u2212v) = 1 \u2212 [J \u2032]\u22121(v).\n\n(15)\n\nHence, under the conditions of Theorem 2, the derivative of J(\u03b7) has the same symmetry as f \u2217\nSince this symmetry is the only constraint on f \u2217\nholds, the risk is said to be in canonical form, and (f \u2217, J) are denoted a canonical pair [6] .\nDe\ufb01nition 1. Let J(\u03b7) be as de\ufb01ned in (8), and C \u2217\nassociated with C \u2217\n\n\u03c6(\u03b7).\n\u03c6, the former can be used as the latter. Whenever this\n\n\u03c6(\u03b7) = \u2212J(\u03b7) a minimum risk. If the optimal link\n\n\u03c6(\u03b7) is\n\nthe risk C\u03c6(\u03b7, f ) is said to be in canonical form. f \u2217\nloss given by (14), a canonical loss.\n\nf \u2217\n\u03c6(\u03b7) = J \u2032(\u03b7)\n\n(16)\n\u03c6(\u03b7) is denoted a canonical link and \u03c6(v), the\n\nNote that (16) does not hold for all risks. For example, the risk of boosting is derived from the\n\nconvex, differentiable, and symmetric J(\u03b7) = \u22122p\u03b7(1 \u2212 \u03b7). Since this has derivative\n\nthe risk is not in canonical form. What follows from (16) is that it is possible to derive a canonical\n\ndiscussed in detail in Section 5.\n\nrisk for any maximal reward J(\u03b7), including that of boosting (J(\u03b7) = \u22122p\u03b7(1 \u2212 \u03b7)). This is\n\u03c6(\u03b7), and then using (14)\nWhile canonical risks can be easily designed by specifying either J(\u03b7) or f \u2217\nand (16), it is much less clear how to directly specify a loss \u03c6(v) for which (14) holds with a\ncanonical pair (f \u2217, J). The following result solves this problem.\nTheorem 4. Let C\u03c6(\u03b7, f ) be the canonical risk derived from a convex and symmetric J(\u03b7). Then\n\n\u03c6\u2032(v) = \u2212[J \u2032]\u22121(\u2212v) = [f \u2217\n\n\u03c6]\u22121(v) \u2212 1.\n\n(18)\n\n4\n\nJ \u2032(\u03b7) =\n\n= f \u2217\n\n\u03c6(\u03b7),\n\n(17)\n\n2\u03b7 \u2212 1\n\np\u03b7(1 \u2212 \u03b7)\n\n6=\n\nlog\n\n1\n2\n\n\u03b7\n\n1 \u2212 \u03b7\n\n\f= (cid:882)1(cid:73)\u2019\n(cid:73)\u2019\n\u2019\n=\n0\n\n(cid:73)(\nv\n\n)\n\n1\n\n0.5\n\n5\n\n.\n(cid:882)1\n\n(cid:73)\u2019\n= (cid:882)0\nf (cid:73)*\n[\n\n]\n\n(\n\nv\n\n)\n\n0(cid:73)\u2019\n(cid:73)\u2019\n=\n\u2019\n=\n\n0\n\nv\nv\n\nk\nn\na\nR\n\n \n.\n\ng\nv\nA\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n \n\n18\n\n16\n\n14\n\nk\nn\na\nR\n\n \n.\n\ng\nv\nA\n\n12\n\n10\n\n8\n\n6\n\n4\n\n \n\n \n\n \n\nCanonical Boosting\n\n9\n\n7\n\n5\n\n3\n0.8\nmargin parameter\n\n1\n\n0.6\n\n0.4\n\n0.2\n\nCanonical Log\n\n9\n\n7\n\n5\n\n3\n0.8\nmargin parameter\n\n1\n\n0.6\n\n0.4\n\n0.2\n\nFigure 1: Left: canonical losses compatible with an IS optimal link. Right: Average classi\ufb01cation rank as a\nfunction of margin parameter, on the UCI data.\n\n1+e\u2212v = [f \u2217\n\n\u03c6]\u22121(v) \u2212 1, while boosting has [f \u2217\n\n\u03c6]\u22121(v) \u2212 1. This (plus the symmetry of J and f \u2217\n\nThis theorem has various interesting consequences. First, it establishes an easy-to-verify necessary\ncondition for the canonical form. For example, logistic regression has [f \u2217\n1+e\u2212v and\n\u03c6\u2032(v) = \u2212 e\u2212v\n1+e\u22122v and \u03c6\u2032(v) = \u2212e\u2212v 6=\n\u03c6) shows that the former is in canonical form\n[f \u2217\nbut the latter is not. Second, it makes it clear that, up to additive constants, the three components\n(\u03c6, C \u2217\n\u03c6) of a canonical risk are related by one-to-one relationships. Hence, it is possible to\ncontrol the properties of the three components of the risk by manipulating a single function (which\ncan be any of the three). Finally, it enables a very detailed characterization of the losses compatible\nwith most optimal links of Table 1.\n\n\u03c6]\u22121(v) =\n\n\u03c6]\u22121(v) =\n\n\u03c6, and f \u2217\n\n1\n\n1\n\n4\n\nInverse-sigmoidal links\n\nInspection of Table 1 suggests that the classi\ufb01ers produced by boosting, logistic regression, and vari-\nants have sigmoidal inverse links [f \u2217\n\u03c6 as inverse-sigmoidal\n(IS). When this is the case, (18) provides a very detailed characterization of the loss \u03c6. In particular,\nit can be trivially shown that, letting f (n) be the nth order derivative of f, that the following hold\n\n\u03c6]\u22121. Due to this, we refer to the links f \u2217\n\nlim\n\nv\u2192\u2212\u221e\n\n[f \u2217\n\n\u03c6]\u22121(v) = 0 \u21d4 lim\nv\u2192\u2212\u221e\n\n\u03c6(1)(v) = \u22121\n\n[f \u2217\n\nlim\nv\u2192\u221e\n\n\u03c6]\u22121(v) = 1 \u21d4 lim\nv\u2192\u221e\n\u03c6]\u22121)(n)(v) = 0, n \u2265 1 \u21d4 lim\nv\u2192\u00b1\u221e\n\nlim\n\nv\u2192\u00b1\u221e\n\n([f \u2217\n\n\u03c6(1)(v) = 0\n\n\u03c6(n+1)(v) = 0, n \u2265 1\n\n[f \u2217\n\n\u03c6]\u22121(v) \u2208 (0, 1) \u21d4 \u03c6(v) monotonically decreasing\n\n[f \u2217\n\n\u03c6]\u22121(v) monotonically increasing \u21d4 \u03c6(v) convex\n\n[f \u2217\n\n\u03c6]\u22121(0) = .5 \u21d4 \u03c6(1)(0) = \u2212.5.\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n(24)\n\nIt follows that, as illustrated in Figure 1, the loss \u03c6(v) is convex, monotonically decreasing, linear\n(with slope \u22121) for large negative v, constant for large positive v, and has slope \u2212.5 at the origin.\nThe set of losses compatible with an IS link is, thus, strongly constrained. The only degrees of\nfreedom are in the behavior of the function around the origin. This is not surprising, since the only\ndegrees of freedom of the sigmoid itself are in its behavior within this region.\n\n5\n\n\f)\nv\n(\n\n1\n\u2212\n\n]\n\n*\u03c6\n\nf\n[\n\n)\nv\n(\n\n1\n\u2212\n\n]\n\n*\u03c6\n\nf\n[\n\nCanonical Log, a=0.4\nCanonical Log, a=1\nCanonical Log, a=10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n\u22120.6\n\n\u22120.4\n\n\u22120.2\n\n0\nv\n\n0.2\n\n0.4\n\n0.6\n\n)\nv\n(\n\u03c6\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n \n\u221210\n\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n \n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\nCanonical Boosting, a=0.4\nCanonical Boosting, a=1\nCanonical Boosting, a=10\n\n)\nv\n(\n\u03c6\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22121\n\n\u22120.5\n\n0\nv\n\n0.5\n\n1\n\n \n\u221220\n\n\u221215\n\n\u221210\n\n\u22125\n\n \n\nCanonical Log, a=0.4\nCanonical Log, a=1\nCanonical Log, a=10\n\n2\n\n4\n\n6\n\n8\n\n10\n\n \n\nCanonical Boosting, a=0.4\nCanonical Boosting, a=1\nCanonical Boosting, a=10\n\n5\n\n10\n\n15\n\n20\n\n0\nv\n\n0\nv\n\nFigure 2: canonical link (left) and loss (right) for various values of a. (Top) logistic, (bottom) boosting.\n\nWhat is interesting is that these are the degrees of freedom that control the margin characteristics\nof the loss \u03c6. Hence, by controlling the behavior of the IS link around the origin, it is possible to\ncontrol the margin of the optimal classi\ufb01er. In particular, the margin is a decreasing function of the\ncurvature of the loss at the origin, \u03c6(2)(0). Since, from (18), \u03c6(2)(0) = ([f \u2217\n\u03c6]\u22121)(1)(0), the margin\ncan be controlled by varying the slope of [f \u2217\n\n\u03c6]\u22121 at the origin.\n\n5 Variable margin loss functions\n\nThe results above enable the derivation of families of canonical losses with controllable margin. In\nSection 3, we have seen that the boosting loss is not canonical, but there is a canonical loss for the\nminimum risk of boosting. We consider a parametric extension of this risk,\n\nJ(\u03b7; a) =\n\nFrom (16), the canonical optimal link is\n\n\u22122\n\na p\u03b7(1 \u2212 \u03b7),\n\na > 0.\n\nand it can be shown that\n\nf \u2217\n\u03c6(\u03b7; a) =\n\n2\u03b7 \u2212 1\n\nap\u03b7(1 \u2212 \u03b7)\n\nis an IS link, i.e. satis\ufb01es (19)-(24). Using (18), the corresponding canonical loss is\n\n[f \u2217\n\n\u03c6]\u22121(v; a) =\n\n1\n2\n\n+\n\nav\n\n2p4 + (av)2\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n\u03c6(v; a) =\n\n1\n2a\n\n(p4 + (av)2 \u2212 av).\n\nBecause it shares the minimum risk of boosting, we refer to this loss as the canonical boosting loss.\nIt is plotted in Figure 2, along with the inverse link, for various values of a. Note that the inverse\n\n6\n\n\fTable 2: Margin parameter value a of rank 1 for each of the ten UCI datasets.\n\nUCI dataset#\nCanonical Log\nCanonical Boost\n\n#1\n0.4\n0.9\n\n#2\n0.5\n6\n\n#3\n0.6\n2\n\n#4\n0.3\n2\n\n#5\n0.1\n0.4\n\n#6\n2\n3\n\n#7\n0.5\n0.2\n\n#8\n0.1\n4\n\n#9\n0.2\n0.2\n\n#10\n0.2\n0.9\n\nlink is indeed sigmoidal, and that the margin is determined by a. Since \u03c6(2)(0; a) = a\nincreases with decreasing a.\n\n4 , the margin\n\nIt is also possible to derive variable margin extensions of existing canonical losses. For example,\nconsider the parametric extension of the minimum risk of logistic regression\n\nFrom (16),\n\nJ(\u03b7; a) =\n\n1\na\n\n\u03b7 log(\u03b7) +\n\n1\na\n\n(1 \u2212 \u03b7) log(1 \u2212 \u03b7).\n\n[f \u2217\n\n\u03c6](v; a) =\n\n1\na\n\nlog\n\n\u03b7\n\n1 \u2212 \u03b7\n\n[f \u2217\n\n\u03c6]\u22121(v; a) =\n\neav\n\n1 + eav\n\n.\n\nThis is again a sigmoidal inverse link and, from (18),\n\n\u03c6(v; a) =\n\n1\na\n\n[log(1 + eav) \u2212 av] .\n\n(29)\n\n(30)\n\n(31)\n\n4 , the margin again increases with decreasing a.\n\nWe denote this loss the canonical logistic loss. It is plotted in Figure 2, along with the corresponding\ninverse link for various a. Since \u03c6(2)(0; a) = a\nNote that, in (28) and (31), margin control is not achieved by simply rescaling the domain of the loss\nfunction. e.g. just replacing log(1 + e\u2212v) by log(1 + e\u2212av) in the case of logistic regression. This\nwould have no impact in classi\ufb01cation accuracy, since it would just amount to a change of scale of the\noriginal feature space. While this type of re-scaling occurs in both families of loss functions above\n(which are both functions of av), it is localized around the origin, and only in\ufb02uences the margin\nproperties of the loss. As can be seen seen in Figure 2 all loss functions are identical away from the\norigin. Hence, varying a is conceptually similar to varying the bandwidth of an SVM kernel. This\nsuggests that the margin parameter a could be cross-validated to achieve best performance.\n\n6 Experiments\n\nA number of easily reproducible experiments were conducted to study the effect of variable mar-\ngin losses on the accuracy of the resulting classi\ufb01ers. Ten binary UCI data sets were considered:\n(#1)sonar, (#2)breast cancer prognostic, (#3)breast cancer diagnostic, (#4)original Wisconsin breast\ncancer, (#5)Cleveland heart disease, (#6)tic-tac-toe, (#7)echo-cardiogram, (#8)Haberman\u2019s survival\n(#9)Pima-diabetes and (#10)liver disorder. The data was split into \ufb01ve folds, four used for train-\ning and one for testing. This produced \ufb01ve training-test pairs per dataset. The GradientBoost\nalgorithm [8], with histogram-based weak learners, was then used to design boosted classi\ufb01ers\nwhich minimize the canonical logistic and boosting losses, for various margin parameters. Gra-\ndientBoost was adopted because it can be easily combined with the different losses, guaranteeing\nthat, other than the loss, every aspect of classi\ufb01er design is constant. This makes the compari-\nson as fair as possible. 50 boosting iterations were applied to each training set, for 19 values of\na \u2208 {0.1, 0.2, ..., 0.9, 1, 2, ..., 10}. The classi\ufb01cation accuracy was then computed per dataset, by\naveraging over its \ufb01ve train/test pairs.\n\nSince existing algorithms can be seen as derived from special cases of the proposed losses, with a =\n1, it is natural to inquire whether other values of the margin parameter will achieve best performance.\nTo explore this question we show, in Figure-1, the average rank of the classi\ufb01er designed with each\nloss and margin parameter a. To produce the plot, a classi\ufb01er was trained on each dataset, for all\n19 values of a. The results were then ranked, with rank 1 (19) being assigned to the a parameter of\nsmallest (largest) error. The ranks achieved with each a were then averaged over the ten datasets, as\nsuggested in [9]. For the canonical logistic loss, the best values of a is in the range 0.2 \u2264 a \u2264 0.3.\nNote that the average rank for this range (between 5 and 6), is better than that (close to 7) achieved\nwith the logistic loss of LogitBoost [2] (a = 1). In fact, as can be seen from Table 2, the canonical\n\n7\n\n\fTable 3: Classi\ufb01cation error for each loss function and UCI dataset.\n\nUCI dataset#\nCanonical Log\n\nLogitBoost (a = 1)\n\nCanonical Boost\n\nCanonical Boost, a = 1\n\nAdaBoost\n\n#1\n11.2\n11.6\n12.6\n13.2\n11.4\n\n#2\n11.4\n12.4\n11.6\n12.4\n11.4\n\n#3\n8\n10\n21\n21\n9.4\n\n#4\n5.6\n6.6\n18.6\n18.6\n6.4\n\n#5\n12.4\n13.4\n17.6\n18.6\n14\n\n#6\n11.8\n48.6\n7.2\n50.8\n28\n\n#7\n7\n6.8\n6\n7.2\n6.6\n\n#8\n18.8\n21.2\n21.8\n21.2\n21.8\n\n#9\n38.2\n39.6\n37.6\n39.4\n41.2\n\n#10\n27\n28.4\n28.6\n28.2\n28.2\n\nTable 4: Classi\ufb01cation error for each loss function and UCI dataset.\n\nUCI dataset#\n\nCanonical Log, a = 0.2\nCanonical Boost, a = 0.2\n\nLogitBoost (a = 1)\n\nAdaBoost\n\n#1\n13.2\n12.6\n12.4\n11.4\n\n#2\n15\n14.8\n15.4\n15.2\n\n#3\n8.4\n17.2\n8.6\n9.2\n\n#4\n5\n\n18.6\n5.6\n6\n\n#5\n11.2\n\n12\n11.4\n11.4\n\n#6\n56.2\n56.8\n46\n\n21.6\n\n#7\n6.8\n6.8\n7.2\n7.4\n\n#8\n24\n\n23.2\n\n25\n\n23.2\n\n#9\n39.8\n38.4\n40.4\n42.8\n\n#10\n25.8\n26.4\n26.4\n26.6\n\nlogistic loss with a = 1 did not achieve rank 1 on any dataset, whereas canonical logistic losses with\n0.2 \u2264 a \u2264 0.3 were top ranked on 3 datasets (and with 0.1 \u2264 a \u2264 0.4 on 6). For the canonical\nboosting loss, there is also a range (0.8 \u2264 a \u2264 2) that produces best results. We note that the a\nvalues of the two losses are not directly comparable. This can be seen from Figure-2 where a = 0.4\nproduces a loss of much larger margin for canonical boosting. Furthermore, the canonical boosting\nloss has a heavier tail and approaches zero more slowly than the canonical logistic loss.\n\nAlthough certain ranges of margin parameters seem to produce best results for both canonical loss\nfunctions, the optimal parameter value is likely to be dataset dependent. This is con\ufb01rmed by Table 2\nwhich presents the parameter value of rank 1 for each of the ten datasets. Improved performance\nshould thus be possible by cross-validating the margin parameter a. Table 3 presents the 5-fold\ncross validation test error (# of misclassi\ufb01ed points) obtained for each UCI dataset and canonical\nloss. The table also shows the results of AdaBoost, LogitBoost (canonical logistic, a = 1), and\ncanonical boosting loss with a = 1. Cross validating the margin results in better performance for\n9 out of 10 (8 out 10) datasets for the canonical logistic (boosting) loss, when compared to the\n\ufb01xed margin (a = 1) counterparts. When compared to the existing algorithms, at least one of the\nmargin-tunned classi\ufb01ers is better than both Logit and AdaBoost for each dataset.\n\nUnder certain experimental conditions, cross validation might not be possible or computationally\nfeasible. Even in this case, it may be better to use a value of a other than the standard a = 1. Table-4\npresents results for the case where the margin parameter is \ufb01xed at a = 0.2 for both canonical loss\nfunctions. In this case, canonical logistic and canonical boosting outperform both LogitBoost and\nAdaBoost in 7 and 5 of the ten datasets, respectively. The converse, i.e. LogitBoost and AdaBoost\noutperforming both canonical losses only happens in 2 and 3 datasets, respectively.\n\n7 Conclusion\n\nThe probability elicitation approach to loss function design, introduced in [3], enables the derivation\nof new Bayes consistent loss functions. Yet, because the procedure is not fully constructive, this\nrequires trial and error. In general, it is dif\ufb01cult to anticipate the properties, and shape, of a loss\nfunction that results from combining a certain minimal risk with a certain link function. In this\nwork, we have addressed this problem for the class of canonical risks. We have shown that the\nassociated canonical loss functions lend themselves to analysis, due to a simple connection between\nthe associated minimum conditional risk and optimal link functions. This analysis was shown to\nenable a precise characterization of 1) the relationships between loss, optimal link, and minimum\nrisk, and 2) the properties of the loss whenever the optimal link is in the family of inverse sigmoid\nfunctions. These properties were then exploited to design parametric families of loss functions\nwith explicit margin control. Experiments with boosting algorithms derived from these variable\nmargin losses have shown better performance than those of classical algorithms, such as AdaBoost\nor LogitBoost.\n\n8\n\n\fReferences\n\n[1] V. N. Vapnik, Statistical Learning Theory.\n[2] J. Friedman, T. Hastie, and R. Tibshirani, \u201cAdditive logistic regression: A statistical view of boosting,\u201d\n\nJohn Wiley Sons Inc, 1998.\n\nAnnals of Statistics, 2000.\n\n[3] H. Masnadi-Shirazi and N. Vasconcelos, \u201cOn the design of loss functions for classi\ufb01cation: theory, robust-\n\nness to outliers, and savageboost,\u201d in NIPS, 2008, pp. 1049\u20131056.\n\n[4] L. J. Savage, \u201cThe elicitation of personal probabilities and expectations,\u201d Journal of the American Statisti-\n\ncal Association, vol. 66, pp. 783\u2013801, 1971.\n\n[5] C. Leistner, A. Saffari, P. M. Roth, and H. Bischof, \u201cOn robustness of on-line boosting - a competitive\n\nstudy,\u201d in IEEE ICCV Workshop on On-line Computer Vision, 2009.\n\n[6] A. Buja, W. Stuetzle, and Y. Shen, \u201cLoss functions for binary class probability estimation and classi\ufb01cation:\n\nStructure and applications,\u201d 2006.\n\n[7] T. Zhang, \u201cStatistical behavior and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion,\u201d Annals of Statistics, 2004.\n\n[8] J. H. Friedman, \u201cGreedy function approximation: A gradient boosting machine,\u201d The Annals of Statistics,\n\nvol. 29, no. 5, pp. 1189\u20131232, 2001.\n\n[9] J. Dem\u02c7sar, \u201cStatistical comparisons of classi\ufb01ers over multiple data sets,\u201d The Journal of Machine Learning\n\nResearch, vol. 7, pp. 1\u201330, 2006.\n\n9\n\n\f", "award": [], "sourceid": 421, "authors": [{"given_name": "Hamed", "family_name": "Masnadi-shirazi", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}