{"title": "Optimal Binary Classifier Aggregation for General Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 5032, "page_last": 5039, "abstract": "We address the problem of aggregating an ensemble of predictors with known loss bounds in a semi-supervised binary classification setting, to minimize prediction loss incurred on the unlabeled data. We find the minimax optimal predictions for a very general class of loss functions including all convex and many non-convex losses, extending a recent analysis of the problem for misclassification error. The result is a family of semi-supervised ensemble aggregation algorithms which are as efficient as linear learning by convex optimization, but are minimax optimal without any relaxations. Their decision rules take a form familiar in decision theory -- applying sigmoid functions to a notion of ensemble margin -- without the assumptions typically made in margin-based learning.", "full_text": "Optimal Binary Classi\ufb01er Aggregation for General\n\nLosses\n\nAkshay Balsubramani\n\nUniversity of California, San Diego\n\nabalsubr@ucsd.edu\n\nYoav Freund\n\nUniversity of California, San Diego\n\nyfreund@ucsd.edu\n\nAbstract\n\nWe address the problem of aggregating an ensemble of predictors with known loss\nbounds in a semi-supervised binary classi\ufb01cation setting, to minimize prediction\nloss incurred on the unlabeled data. We \ufb01nd the minimax optimal predictions for\na very general class of loss functions including all convex and many non-convex\nlosses, extending a recent analysis of the problem for misclassi\ufb01cation error. The\nresult is a family of semi-supervised ensemble aggregation algorithms which are\nas ef\ufb01cient as linear learning by convex optimization, but are minimax optimal\nwithout any relaxations. Their decision rules take a form familiar in decision\ntheory \u2013 applying sigmoid functions to a notion of ensemble margin \u2013 without the\nassumptions typically made in margin-based learning.\n\n1\n\nIntroduction\n\nConsider a binary classi\ufb01cation problem, in which we are given an ensemble of individual classi\ufb01ers\nto aggregate into the most accurate predictor possible for data falling into two classes. Our predic-\ntions are measured on a large test set of unlabeled data, on which we know the ensemble classi\ufb01ers\u2019\npredictions but not the true test labels. Without using the unlabeled data, the prototypical super-\nvised solution is empirical risk minimization (ERM): measure the errors of the ensemble classi\ufb01ers\nwith labeled data, and then simply predict according to the best classi\ufb01er. But can we learn a better\npredictor by using unlabeled data as well?\nThis problem is central to semi-supervised learning. The authors of this paper recently derived\nthe worst-case-optimal solution for it when performance is measured with classi\ufb01cation error ([1]).\nHowever, this zero-one loss is inappropriate for other common binary classi\ufb01cation tasks, such as\nestimating label probabilities, and handling false positives and false negatives differently. Such goals\nmotivate the use of different evaluation losses like log loss and cost-weighted misclassi\ufb01cation loss.\nIn this paper, we generalize the setup of [1] to these loss functions and a large class of others. Like\nthe earlier work, the choice of loss function completely speci\ufb01es the minimax optimal ensemble\naggregation algorithm in our setting, which is ef\ufb01cient and scalable.\nThe algorithm learns weights over the ensemble classi\ufb01ers by minimizing a convex function. The\noptimal prediction on each example in the test set is a sigmoid-like function of a linear combination\nof the ensemble predictions, using the learned weighting. Due to the minimax structure, this decision\nrule depends solely upon the loss function and upon the structure of the ensemble predictions on data,\nwith no parameter or model choices.\n\n1.1 Preliminaries\nOur setting generalizes that of [1], in which we are given an ensemble H = {h1, . . . , hp} and unla-\nbeled (test) examples x1, . . . , xn on which to predict. The ensemble\u2019s predictions on the unlabeled\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdata are written as a matrix F:\n\nF =\n\n\uf8eb\uf8ec\uf8edh1(x1) h1(x2)\n\n...\n\n...\n\nhp(x1) hp(x2)\n\n\uf8f6\uf8f7\uf8f8\n\n\u00b7\u00b7\u00b7 h1(xn)\n...\n\u00b7\u00b7\u00b7 hp(xn)\n\n...\n\nWe use vector notation for the rows and columns of F: hi = (hi(x1),\u00b7\u00b7\u00b7 , hi(xn))(cid:62) and xj =\n(h1(xj),\u00b7\u00b7\u00b7 , hp(xj))(cid:62). Each example j \u2208 [n] has a binary label yj \u2208 {\u22121, 1}, but the test labels\nare allowed to be randomized, represented by values in [\u22121, 1] instead of just the two values {\u22121, 1};\ne.g. zi = 1\n4. So the labels on the test data can be\nrepresented by z = (z1; . . . ; zn) \u2208 [\u22121, 1]n, and are unknown to the predictor, which predicts\ng = (g1; . . . ; gn) \u2208 [\u22121, 1]n.\n\n2 indicates yi = +1 w.p. 3\n\n4 and \u22121 w.p. 1\n\n(1)\n\n1.2 Loss Functions\n\n2 (1 \u2212 gj) and (cid:96)\u2212(gj) = 1\n\nWe incur loss on test example j according to its true label yj. If yj = 1, then the loss of predicting\ngj \u2208 [\u22121, 1] on it is some function (cid:96)+(gj); and if yj = \u22121, then the loss is (cid:96)\u2212(gj). To illustrate,\nif the loss is the expected classi\ufb01cation error, then gj \u2208 [\u22121, 1] can be interpreted as a randomized\nbinary prediction in the same way as zj, so that (cid:96)+(gj) = 1\nWe call (cid:96)\u00b1 the partial losses here, following earlier work (e.g. [16]). Since the true label can only\nbe \u00b11, the partial losses fully specify the decision-theoretic problem we face, and changing them is\ntantamount to altering the prediction task.\nWhat could such partial losses conceivably look like in general? Observe that they intuitively mea-\nsure discrepancy to the true label \u00b11. Consequently, it is natural for e.g. (cid:96)+(g) to be decreasing,\nas g increases toward the notional true label +1. This suggests that both partial losses (cid:96)+(\u00b7) and\n(cid:96)\u2212(\u00b7) would be monotonic, which we assume hereafter in this paper (throughout we use increasing\nto mean \u201cmonotonically nondecreasing\u201d and vice versa).\nAssumption 1. Over the interval (\u22121, 1), (cid:96)+(\u00b7) is decreasing and (cid:96)\u2212(\u00b7) is increasing, and both are\ntwice differentiable.\n\n2 (1 + gj).\n\nWe view Assumption 1 as very mild, as motivated above. Notably, convexity or symmetry of the\npartial losses are not required. In this paper, \u201cgeneral losses\u201d refer to loss functions whose partial\nlosses satisfy Assumption 1, to contrast them with convex losses or other subclasses.\nThe expected loss incurred w.r.t. the randomized true labels zj is a linear combination of the partial\nlosses:\n\n(cid:18) 1 + zj\n\n(cid:19)\n\n2\n\n(cid:18) 1 \u2212 zj\n\n(cid:19)\n\n2\n\n(cid:96)(zj, gj) :=\n\n(cid:96)+(gj) +\n\n(cid:96)\u2212(gj)\n\n(2)\n\nDecision theory and learning theory have thoroughly investigated the nature of the loss (cid:96) and its\npartial losses, particularly how to estimate the \u201cconditional label probability\u201d zj using (cid:96)(zj, gj). A\nnatural operation to do this is to minimize the loss over gj, and a loss (cid:96) such that arg min\n(cid:96)(zj, g) =\ng\u2208[\u22121,1]\nzj (for all zj \u2208 [\u22121, 1]) is called a proper loss ([17, 16]).\n\n1.3 Minimax Formulation\n\nAs in [1], we formulate the ensemble aggregation problem as a two-player zero-sum game between\na predictor and an adversary. In this game, the \ufb01rst player is the predictor, playing predictions over\nthe test set g \u2208 [\u22121, 1]n. The adversary then sets the true labels z \u2208 [\u22121, 1]n.\nThe key idea is that any ensemble constituent i \u2208 [p] known to have low loss on the test data gives\nus information about the unknown z, as z is constrained to be \u201cclose\u201d to the test predictions hi.\nEach hypothesis in the ensemble represents such a constraint, and z is in the intersection of all these\nconstraint sets, which interact in ways that depend on the ensemble predictions F.\nAccordingly, for now assume the predictor knows a vector of label correlations b such that\n\n\u2200i \u2208 [p] :\n\n1\nn\n\nhi(xj)zj \u2265 bi\n\n(3)\n\nn(cid:88)\n\nj=1\n\n2\n\n\fn Fz \u2265 b. When the ensemble is composed of binary classi\ufb01ers which predict in [\u22121, 1], these\ni.e. 1\np inequalities represent upper bounds on individual classi\ufb01er error rates. These can be estimated\nfrom the training set w.h.p. when the training and test data are i.i.d. using uniform convergence,\nexactly as in the prototypical supervised ERM procedure discussed in the introduction ([5]). So in\nour game-theoretic formulation, the adversary plays under ensemble constraints de\ufb01ned by b.\nThe predictor\u2019s goal is to minimize the worst-case expected loss of g on the test data (w.r.t.\nrandomized labeling z), using the loss function as de\ufb01ned earlier in Equation (2):\n\nthe\n\nn(cid:88)\n\nj=1\n\n(cid:96)(z, g) :=\n\n1\nn\n\n(cid:96)(zj, gj)\n\nThis goal can be written as the following optimization problem, a two-player zero-sum game:\n\nV := min\n\ng\u2208[\u22121,1]n\n\nmax\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\n(cid:96)(z, g)\n\n= min\n\ng\u2208[\u22121,1]n\n\n1\nn\n\nmax\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\nn(cid:88)\n\n(cid:20)(cid:18) 1 + zj\n\nj=1\n\n2\n\n(cid:19)\n\n(cid:96)+(gj) +\n\n(cid:18) 1 \u2212 zj\n\n(cid:19)\n\n2\n\n(cid:21)\n\n(cid:96)\u2212(gj)\n\n(4)\n\n(5)\n\nIn this paper, we solve the learning problem faced by the predictor, \ufb01nding an optimal strategy\ng\u2217 realizing the minimum in (4) for any given \u201cgeneral loss\u201d (cid:96). This strategy guarantees the best\npossible worst-case performance on the unlabeled dataset, with an upper bound of V on the loss.\nIndeed, for all z0 and g0 obeying the constraints, Equation (4) implies the tight inequalities\n\nmin\n\ng\u2208[\u22121,1]n\n\n(cid:96)(z0, g)\n\n(a)\u2264 V \u2264 max\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\n(cid:96)(z, g0)\n\n(6)\n\nand g\u2217 attains the equality in (a), with a worst-case loss as good as any aggregated predictor.\nIn our formulation of the problem, the constraints on the adversary take a central role. As discussed\nin previous work with this formulation ([1, 2]), these constraints encode the information we have\nabout the true labels. Without them, the adversary would \ufb01nd it optimal to trivially guarantee error\n(arbitrarily close to) 1\n2 by simply setting all labels uniformly at random (z = 0n). It is clear that\nadding information through more constraints will never raise the error bound V . 1\nNothing has yet been assumed about (cid:96)(z, g) other than Assumption 1. Our main results will require\nonly this, holding for general losses. This brings us to this paper\u2019s contributions:\n\n1. We give the exact minimax g\u2217 \u2208 [\u22121, 1]n for general losses (Section 2.1). The optimal\nprediction on each example j is a sigmoid function of a \ufb01xed linear combination of the\nensemble\u2019s p predictions on it, so g\u2217 is a non-convex function of the ensemble predictions.\nBy (6), this incurs the lowest worst-case loss of any predictor constructed with the ensemble\ninformation F and b.\n\n2. We derive an ef\ufb01cient algorithm for learning g\u2217, by solving a p-dimensional convex opti-\nmization problem. This applies to a broad class of losses (cf. Lem. 2), including any with\nconvex partial losses. Sec. 2 develops and discusses the results.\n\n3. We extend the optimal g\u2217 and ef\ufb01cient learning algorithm for it, as above, to a large variety\nof more general ensembles and prediction scenarios (Sec. 3), including constraints arising\nfrom general loss bounds, and ensembles of \u201cspecialists\u201d and heterogeneous features.\n\n2 Results for Binary Classi\ufb01cation\nBased on the loss, de\ufb01ne the function \u0393 : [\u22121, 1] (cid:55)\u2192 R as \u0393(g) := (cid:96)\u2212(g) \u2212 (cid:96)+(g). (We also write\nthe vector \u0393(g) componentwise with [\u0393(g)]j = \u0393(gj) for convenience, so that \u0393(hi) \u2208 Rn and\n\u0393(xj) \u2208 Rp.) Observe that by Assumption 1, \u0393(g) is increasing on its domain; so we can discuss\nits inverse \u0393\u22121(m), which is typically sigmoid-shaped, as will be illustrated.\nWith these we will set up the solution to the game (4), which relies on a convex function.\n\n1However, it may pose dif\ufb01culties in estimating b by applying uniform convergence over a larger H ([2]).\n\n3\n\n\fFigure 1: At left are plots of potential wells. At right are optimal prediction functions g, as a function of score.\nBoth are shown for various losses, as listed in Section 2.3.\n\nDe\ufb01nition 1 (Potential Well). De\ufb01ne the potential well\n\n\uf8f1\uf8f2\uf8f3\u2212m + 2(cid:96)\u2212(\u22121)\n\n(cid:96)+(\u0393\u22121(m)) + (cid:96)\u2212(\u0393\u22121(m))\nm + 2(cid:96)+(1)\n\n\u03a8(m) :=\n\nif m \u2264 \u0393(\u22121)\nif m \u2208 (\u0393(\u22121), \u0393(1))\nif m \u2265 \u0393(1)\n\nLemma 2. The potential well \u03a8(m) is continuous and 1-Lipschitz. It is also convex under any of\nthe following conditions:\n\n(A) The partial losses (cid:96)\u00b1(\u00b7) are convex over (\u22121, 1).\n(B) The loss function (cid:96)(\u00b7,\u00b7) is a proper loss.\n+(x) for all x \u2208 (\u22121, 1).\n(C) (cid:96)(cid:48)\n\n+(x) \u2265 (cid:96)(cid:48)(cid:48)\n\n\u2212(x)(cid:96)(cid:48)(cid:48)\n\n\u2212(x)(cid:96)(cid:48)\n\nCondition (C) is also necessary for convexity of \u03a8, under Assumption 1.\n\nSo the potential wells for different losses are shaped similarly, as seen in Figure 1. Lemma 2 tells\nus that the potential well is easy to optimize under any of the given conditions. Note that these\nconditions encompass convex surrogate losses commonly used in ERM, including all such \u201cmargin-\nbased\u201d losses (convex univariate functions of zjgj), introduced primarily for their favorable compu-\ntational properties.\nAn easily optimized potential well bene\ufb01ts us, because the learning problem basically consists of\noptimizing it over the unlabeled data, as we will soon make explicit. The function that will actually\nbe optimized is in terms of the dual parameters, so we call it the slack function.\nDe\ufb01nition 3 (Slack Function). Let \u03c3 \u2265 0p be a weight vector over H (not necessarily a distri-\nbution). The vector of scores is F(cid:62)\u03c3 = (x(cid:62)\nn \u03c3), whose elements\u2019 magnitudes are the\nmargins. The prediction slack function is\n\n1 \u03c3, . . . , x(cid:62)\n\n\u03b3(\u03c3, b) := \u03b3(\u03c3) := \u2212b(cid:62)\u03c3 +\n\n1\nn\n\n\u03a8(x(cid:62)\n\nj \u03c3)\n\n(7)\n\nn(cid:88)\n\nj=1\n\nAn optimal weight vector \u03c3\u2217 is any minimizer of the slack function: \u03c3\u2217 \u2208 arg min\n\u03c3\u22650p\n\n[\u03b3(\u03c3)].\n\n2.1 Solution of the Game\n\nThese are used to describe the minimax equilibrium of the game (4), in our main result.\nTheorem 4. The minimax value of the game (4) is\n\n\uf8ee\uf8f0\u2212b(cid:62)\u03c3 +\n\nn(cid:88)\n\nj=1\n\n1\nn\n\n\uf8f9\uf8fb\n\n\u03a8(x(cid:62)\n\nj \u03c3)\n\nmin\n\ng\u2208[\u22121,1]n\n\nmax\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\n(cid:96)(z, g) = V =\n\n\u03b3(\u03c3\u2217) =\n\n1\n2\n\n1\n2\n\nmin\n\u03c3\u22650p\n\n4\n\n\fThe minimax optimal predictions are de\ufb01ned as follows: for all j \u2208 [n],\nj \u03c3\u2217 \u2264 \u0393(\u22121)\nj \u03c3\u2217 \u2208 (\u0393(\u22121), \u0393(1))\nj \u03c3\u2217 \u2265 \u0393(1)\n\ng\u2217\nj := gj(\u03c3\u2217) =\n\nif x(cid:62)\nif x(cid:62)\nif x(cid:62)\n\n\u0393\u22121(x(cid:62)\n1\n\n\uf8f1\uf8f2\uf8f3\u22121\n\nj \u03c3\u2217)\n\n(8)\n\ng\u2217\nj is always an increasing sigmoid, as shown in Figure 1.\nWe can also redo the proof of Theorem 4 when g \u2208 [\u22121, 1]n is not left as a free variable set in the\ngame, but instead is preset to g(\u03c3) as in (8) for some (possibly suboptimal) weight vector \u03c3.\nObservation 5. For any weight vector \u03c30 \u2265 0p, the worst-case loss after playing g(\u03c30) is\n\n(cid:96)(z, g(\u03c30)) \u2264 1\n2\n\n\u03b3(\u03c30)\n\nmax\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\nThe proof is a simpli\ufb01ed version of that of Theorem 4; there is no minimum over g to deal with,\nand the minimum over \u03c3 \u2265 0p in Equation (13) is upper-bounded by using \u03c30. This result is an\nexpression of weak duality in our setting, and generalizes Observation 4 of [1].\n\n2.2 Ensemble Aggregation Algorithm\n\nTheorem 4 de\ufb01nes a prescription for aggregating the given ensemble predictions on the test set.\nLearning: Minimize the slack function \u03b3(\u03c3), \ufb01nding the minimizer \u03c3\u2217 that achieves V .\nThis is a convex optimization under broad conditions (Lemma 2), and when the test examples are\ni.i.d. the \u03a8 term is a sum of n i.i.d. functions. Therefore, it is readily amenable to standard \ufb01rst-order\noptimization methods which require only O(1) test examples at once. In practice, learning employs\nsuch methods to approximately minimize \u03b3, \ufb01nding some \u03c3A such that \u03b3(\u03c3A) \u2264 \u03b3(\u03c3\u2217) + \u0001 for\nsome small \u0001. Standard convex optimization methods are guaranteed to do this for binary classi\ufb01er\nensembles, because the slack function is Lipschitz (Lemma 2) and (cid:107)b(cid:107)\u221e \u2264 1.\nPrediction: Predict g(\u03c3\u2217) on any test example, as indicated in (8).\nThis decouples the prediction task over each test example separately, which requires O(p) time and\nmemory like p-dimensional linear prediction. After \ufb01nding an \u0001-approximate minimizer \u03c3A in the\nlearning step as above, Observation 5 tells us that the prediction g(\u03c3A) has loss \u2264 V + \u0001\n2.\nIn particular, note that there is no algorithmic dependence on n in either step in a statistical learning\nsetting. So though our formulation is transductive, it is no less tractable than a stochastic optimiza-\ntion setting in which i.i.d. data arrive one at a time, and applies to this common situation.\n\n2.3 Examples of Different Losses\n\nTo further illuminate Theorem 4, we detail a few special cases in which (cid:96)\u00b1 are explicitly de\ufb01ned.\nThese losses may be found throughout the literature (see e.g. [16]). The key functions \u03a8 and g\u2217 are\nlisted for these losses in Appendix A, and in many cases in Figure 1. The nonlinearities used for g\u2217\nare sigmoids, arising solely from the intrinsic minimax structure of the classi\ufb01cation game.\n\nin [1], the work we generalize in this paper.\n\n\u2022 0-1 Loss: Here gj is taken to be a randomized binary prediction; this case was developed\n\u2022 Log Loss, Square Loss\n\u2022 Cost-Weighted Misclassi\ufb01cation (Quantile) Loss: This is de\ufb01ned with a parameter c \u2208\n[0, 1] representing the relative cost of false positives vs. false negatives, making the Bayes-\noptimal classi\ufb01er the c-quantile of the conditional probability distribution ([19]).\n\ntypically\n\ngiven\n\nas\n. Our formulation is equivalent when the\n\n[0, 1]\n\np, y\n\nfor\n\n\u2208\n\n\u2022 Exponential Loss, Logistic Loss\n\u2022 Hellinger Loss:\nis\n1 \u2212 p \u2212 \u221a\n\n+(cid:0)\u221a\n\n(cid:16)(cid:0)\u221a\n\np \u2212 \u221a\n\ny(cid:1)2\n\nThis\n\n1 \u2212 y(cid:1)2(cid:17)\n\nprediction and label are rescaled to [\u22121, 1].\n\n1\n2\n\n5\n\n\f\u2022 \u201cAdaBoost Loss\u201d: If the goal of AdaBoost ([18]) is interpreted as class probability esti-\nmation, the implied loss is proper and given in [6, 16].\n\u2022 Absolute Loss and Hinge Loss: The absolute loss can be de\ufb01ned by (cid:96)abs\u2213 (gj) = 1 \u00b1 gj,\nand the hinge loss also has (cid:96)abs\u2213 (gj) = 1 \u00b1 gj since the kink in the hinge loss only lies at\ngj = \u22131. These partial losses are the same as for 0-1 loss up to scaling, and therefore all\nour results for \u03a8 and g\u2217 are as well. So these losses are not shown in Appendix A.\n\u2022 Sigmoid Loss: This is an example of a sigmoid-shaped margin loss, a nonconvex smooth\nsurrogate for 0-1 loss. Similar losses have arisen in a variety of binary classi\ufb01cation con-\ntexts, from robustness (e.g. [9]) to active learning ([10]) and structured prediction ([14]).\n\n2.4 Related Work and Technical Discussion\n\nThere are two notable ways in which the result of Theorem 4 is particularly advantageous and gen-\neral. First, the fact that (cid:96)(z, g) can be non-convex in g, yet solvable by convex optimization, is a\nmajor departure from previous work. Second, the solution has a convenient dependence on n (as\nin [1]), simply averaging a function over the unlabeled data, which is not only mathematically con-\nvenient but also makes stochastic O(1)-space optimization practical. This is surprisingly powerful,\nbecause the original minimax problem is jointly over the entire dataset, avoiding further indepen-\ndence or decoupling assumptions.\nBoth these favorable properties stem from the structure of the binary classi\ufb01cation problem, as\nwe can describe by examining the optimization problem constructed within the proof of Thm. 4\n(Appendix C.1). In it, the constraints which do not explicitly appear with Lagrange parameters are\nall box, or L\u221e norm, constraints. These decouple over the n test examples, so the problem can\nbe reduced to the one-dimensional optimization at the heart of Eq. (14), which is solved ad hoc.\nSo we are able to obtain minimax results for these non-convex problems \u2013 the gi are \u201cclipped\u201d\nsigmoid functions because of the bounding effect of the [\u22121, 1] box constraints intrinsic to binary\nclassi\ufb01cation. We introduce Lagrange parameters \u03c3 only for the p remaining constraints in the\nproblem, which do not decouple as above, applying globally over the n test examples. However,\nthese constraints only depend on n as an average over examples (which is how they arise in dual\nform in Equation (16) of the proof), and the loss function itself is also an average (Equation (12)).\nThis makes the box constraint decoupling possible, and leads to the favorable dependence on n,\nmaking an ef\ufb01cient solution possible to a potentially \ufb02agrantly non-convex problem.\nTo summarize, the technique of optimizing only \u201chalfway into\u201d the dual allows us to readily manip-\nulate the minimax problem exactly without using an approximation like weak duality, despite the\nlack of convexity in g. This technique was used implicitly for a different purpose in the \u201cdrifting\ngame\u201d analysis of boosting ([18], Sec. 13.4.1). Existing boosting work is loosely related to our\napproach in being a transductive game invoked to analyze ensemble aggregation, but it does not\nconsider unlabeled data and draws its power instead from being a repeated game ([18]).\nThe predecessor to this work ([1]) addresses a problem, 0-1 loss minimization, that is known to be\nNP-hard when solved directly ([11]). Using the unlabeled data is essential to surmounting this. It\ngives the dual problem an independently interesting interpretation, so the learning problem is on the\nalways-convex Lagrange dual function and is therefore tractable.\nThis paper\u2019s transductive formulation involves no surrogates or relaxations of the loss, in sharp con-\ntrast to most previous work. This allows us to bypass the consistency and agnostic-learning discus-\nsions ([22, 3]) common to ERM methods that use convex risk minimization. Convergence analyses\nof those methods make heavy use of convexity of the losses and are generally done presupposing a\nlinear weighting over H ([21]), whereas here such structure emerges directly from Lagrange duality\nand involves no convexity to derive the worst-case-optimal predictions.\nThe conditions in Assumption 1 are notably general. Differentiability of the partial losses is con-\nvenient, but not necessary, and only used because \ufb01rst-order conditions are a convenient way to\nestablish convexity of the potential well in Lemma 2. It is never used elsewhere, including in the\nminimax arguments used to prove Theorem 4. These manipulations are structured to be valid even if\n(cid:96)\u00b1 are non-monotonic; but in this case, g\u2217\nj could turn out to be multi-valued and hence not a genuine\nfunction of the example\u2019s score.\nWe emphasize that our result on the minimax equilibrium (Theorem 4) holds for general losses\n\u2013 the slack function may not be convex unless the further conditions of Lemma 2 are met, but\n\n6\n\n\fj \u03c3) is always increasing in x(cid:62)\n\nthe interpretation of the optimum in terms of margins and sigmoid functions remains. All this\nemerges from the inherent decision-theoretic structure of the problem (the proof of Appendix C.1). It\nmanifests in the fact that the function g(x(cid:62)\nj \u03c3 for general losses, because\nthe function \u0393 is increasing. This monotonicity typically needs to be assumed in a generalized linear\nmodel (GLM; [15]) and related settings. \u0393 appears loosely analogous to the link function used by\nGLMs, with its inverse being used for prediction.\nThe optimal decision rules emerging from our framework are arti\ufb01cial neurons of the ensemble in-\nputs. Helmbold et al. introduce the notion of a \u201cmatching loss\u201d ([13]) for learning the parameters of\na (fully supervised) arti\ufb01cial neuron with an arbitrary increasing transfer function, effectively taking\nthe opposite tack of this paper in using a neuron\u2019s transfer function to derive a loss to minimize in or-\nder to learn the neuron\u2019s weights by convex optimization. Our assumptions on the loss, particularly\ncondition (C) of Lemma 2, have arisen independently in earlier online learning work by some of the\nsame authors ([12]); this may suggest connections between our techniques. We also note that our\nfamily of general losses was de\ufb01ned independently by [19] in the ERM setting (dubbed \u201cF-losses\u201d)\n\u2013 in which condition (C) of Lemma 2 also has signi\ufb01cance ([19], Prop. 2) \u2013 but has seemingly\nnot been revisited thereafter. Further \ufb02eshing out the above connections would be interesting future\nwork.\n\n3 Extensions\n\nWe detail a number of generalizations to the basic prediction scenario of Sec. 2. These extensions\nare not mutually exclusive, and apply in conjunction with each other, but we list them separately for\nclarity. They illustrate the versatility of our minimax framework, particularly Sec. 3.4.\n\n3.1 Weighted Test Sets and Covariate Shift\n\nThough our results here deal with binary classi\ufb01cation of an unweighted test set, the formulation\ndeals with a weighted set with only a slight modi\ufb01cation of the slack function:\nTheorem 6. For any vector r \u2265 0n,\n\n\uf8ee\uf8f0\u2212b(cid:62)\u03c3 +\n\nn(cid:88)\n\n1\nn\n\n1\n2\n\nmin\n\u03c3\u22650p\n\n(cid:32)\n\nrj\u03a8\n\nx(cid:62)\nj \u03c3\nrj\n\n(cid:33)\uf8f9\uf8fb\n\nn(cid:88)\n\nj=1\n\n1\nn\n\nmin\n\nrj(cid:96)(zj, gj) =\n\nmax\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\ng\u2208[\u22121,1]n\nr as the minimizer of the RHS above, the optimal predictions g\u2217 = g(\u03c3\u2217\n\nWriting \u03c3\u2217\nr ), as in Theorem 4.\nSuch weighted classi\ufb01cation can be parlayed into algorithms for general supervised learning prob-\nlems via learning reductions ([4]). Allowing weights on the test set for the evaluation is tantamount\nto accounting for known covariate shift in our setting; it would be interesting, though outside our\nscope, to investigate scenarios with more uncertainty.\n\nj=1\n\n3.2 General Loss Constraints on the Ensemble\n\nSo far in the paper, we have considered the constraints on ensemble classi\ufb01ers as derived from\ntheir label correlations (i.e. 0-1 losses), as in (3). However, this view can be extended signi\ufb01cantly\nwith the same analysis, because any general loss (cid:96)(z, g) is linear in z (Eq. (2)), which allows our\ndevelopment to go through essentially intact.\nIn summary, a classi\ufb01er can be incorporated into our framework for aggregation if we have a gener-\nalization loss bound on it, for any \u201cgeneral loss.\u201d This permits an enormous variety of constraint sets,\nas each classi\ufb01er considered can have constraints corresponding to any number of loss bounds on it,\neven multiple loss bounds using different losses. For instance, h1 can yield a constraint correspond-\ning to a zero-one loss bound, h2 can yield one constraint corresponding to a square loss bound and\nanother corresponding to a zero-one loss bound, and so on. Appendix B details this idea, extending\nTheorem 4 to general loss constraints.\n\n3.3 Uniform Convergence Bounds for b\n\nIn our basic setup, b has been taken as a lower bound on ensemble classi\ufb01er label correlations.\nBut as mentioned earlier, the error in estimating b is in fact often quanti\ufb01ed by two-sided uniform\n\n7\n\n\fconvergence (L\u221e) bounds on b. Constraining z in this fashion amounts to L1 regularization of the\ndual vector \u03c3.\nProposition 7. For any \u0001 \u2265 0,\n\n\uf8ee\uf8f0\u2212b(cid:62)\u03c3 +\n\nn(cid:88)\n\nj=1\n\n1\nn\n\n\uf8f9\uf8fb\n\n\u03a8(x(cid:62)\n\nj \u03c3) + \u0001(cid:107)\u03c3(cid:107)1\n\nmin\n\ng\u2208[\u22121,1]n\n\nmax\n\nz\u2208[\u22121,1]n,\n(cid:107) 1\nn Fz\u2212b(cid:107)\u221e\u2264\u0001\n\n(cid:96)(z, g) = min\n\u03c3\u2208Rp\n\nAs in Thm. 4, the optimal g\u2217 = g(\u03c3\u2217\n\n\u221e), where \u03c3\u2217\n\n\u221e is the minimizer of the right-hand side above.\n\nHere we optimize over all vectors \u03c3 (not just nonnegative ones) in an L1-regularized problem, con-\nvenient in practice because we can cross-validate over the parameter \u0001. To our knowledge, this\nparticular analysis has been addressed in prior work only for the special case of the entropy loss on\nthe probability simplex, discussed further in [8].\nProp. 7 is a corollary of a more general result using differently scaled label correlation deviations\nto regularizing the minimization over \u03c3 by its c-weighted L1 norm c(cid:62) |\u03c3| (Thm. 11), used to\npenalize the ensemble nonuniformly ([7]). This situation is motivated by uniform convergence of\nheterogeneous ensembles comprised of e.g. \u201cspecialist\u201d predictors, for which a union bound ([5])\n\nn Fz \u2212 b(cid:12)(cid:12) \u2264 c for a general c \u2265 0n. This turns out to be equivalent\n\nn Fz \u2212 b(cid:12)(cid:12) with varying coordinates. Such ensembles are discussed next.\n\nwithin the ensemble, i.e. (cid:12)(cid:12) 1\nresults in(cid:12)(cid:12) 1\n\n3.4 Heterogenous Ensembles of Specialist Classi\ufb01ers and Features\n\nAll the results and algorithms in this paper apply in full generality to ensembles of \u201cspecialist\u201d\nclassi\ufb01ers that only predict on some subset of the test examples. This is done by merely calculating\nthe constraints over only these examples, and changing F and b accordingly ([2]).\nTo summarize this from [2], suppose a classi\ufb01er i \u2208 [p] decides to abstain on an example xj\n(j \u2208 [n]) with probability 1 \u2212 vi(x), and otherwise predict hi(x). Our only assumption on\nj=1 vi(xj) > 0, so classi\ufb01er i is not a useless\n\n{vi(x1), . . . , vi(xn)} is the reasonable one that(cid:80)n\n\nspecialist that abstains everywhere.\nThe information about z contributed by classi\ufb01er i is now not its overall correlation with z on the\nentire test set, but rather the correlation with z restricted to the test examples on which it predicts.\nn Sz, where the matrix S is formed by reweighting each row of F:\nOn the test set, this is written as 1\n\n\uf8eb\uf8ec\uf8ec\uf8ed\u03c11(x1)h1(x1) \u03c11(x2)h1(x2)\n\n\u03c12(x1)h2(x1) \u03c12(x2)h2(x2)\n\n...\n\n...\n\n\u03c1p(x1)hp(x1) \u03c1p(x2)hp(x2)\n\n\uf8f6\uf8f7\uf8f7\uf8f8 ,\n\n\u00b7\u00b7\u00b7 \u03c11(xn)h1(xn)\n\u00b7\u00b7\u00b7 \u03c12(xn)h2(xn)\n...\n\u00b7\u00b7\u00b7 \u03c1p(xn)hp(xn)\n\n...\n\nS := n\n\n(cid:80)n\n\nvi(xj)\nk=1 vi(xk)\n\n\u03c1i(xj) :=\n\n(S = F when the entire ensemble consists of non-specialists, recovering our initial setup.) There-\nn Sz \u2265 bS, where bS gives the label correlations of each\nfore, the ensemble constraints (3) become 1\nclassi\ufb01er restricted to the examples on which it predicts. Though this rescaling results in entries of\nS having different ranges and magnitudes \u2265 1, our results and proofs remain entirely intact.\nIndeed, despite the title, this paper applies far more generally than to an ensemble of binary classi-\n\ufb01ers, because our proofs make no assumptions at all about the structure of F. Each predictor in the\nensemble can be thought of as a feature; it has so far been convenient to think of it as binary, fol-\nlowing the perspective of binary classi\ufb01er aggregation, but it could as well be e.g. real-valued, and\nthe features can have very different scales (as in S above). An unlabeled example x is simply a vec-\ntor of features, so arbitrarily abstaining specialists are equivalent to \u201cmissing features,\u201d which this\nframework handles seamlessly due to the given unlabeled data. Our development applies generally\nto semi-supervised binary classi\ufb01cation.\n\nAcknowledgements\n\nAB is grateful to Chris \u201cCeej\u201d Tosh for feedback that made the manuscript clearer. This work was\nsupported by the NSF (grant IIS-1162581).\n\n8\n\n\f", "award": [], "sourceid": 2583, "authors": [{"given_name": "Akshay", "family_name": "Balsubramani", "institution": "UC San Diego"}, {"given_name": "Yoav", "family_name": "Freund", "institution": "University of California, San Diego"}]}