{"title": "Quantifying Learning Guarantees for Convex but Inconsistent Surrogates", "book": "Advances in Neural Information Processing Systems", "page_first": 669, "page_last": 677, "abstract": "We study consistency properties of machine learning methods based on minimizing convex surrogates. We extend the recent framework of Osokin et al. (2017) for the quantitative analysis of consistency properties to the case of inconsistent surrogates. Our key technical contribution consists in a new lower bound on the calibration function for the quadratic surrogate, which is non-trivial (not always zero) for inconsistent cases. The new bound allows to quantify the level of inconsistency of the setting and shows how learning with inconsistent surrogates can have guarantees on sample complexity and optimization difficulty. We apply our theory to two concrete cases: multi-class classification with the tree-structured loss and ranking with the mean average precision loss. The results show the approximation-computation trade-offs caused by inconsistent surrogates and their potential benefits.", "full_text": "Quantifying Learning Guarantees for Convex but\n\nInconsistent Surrogates\n\nKirill Struminsky\n\nNRU HSE,\u2217Moscow, Russia\n\nSimon Lacoste-Julien\u2020\n\nMILA and DIRO\n\nUniversit\u00e9 de Montr\u00e9al, Canada\n\nAnton Osokin\n\nNRU HSE,\u2217\u2021 Moscow, Russia\nSkoltech,\u00a7 Moscow, Russia\n\nAbstract\n\nWe study consistency properties of machine learning methods based on minimizing\nconvex surrogates. We extend the recent framework of Osokin et al. [14] for the\nquantitative analysis of consistency properties to the case of inconsistent surrogates.\nOur key technical contribution consists in a new lower bound on the calibration\nfunction for the quadratic surrogate, which is non-trivial (not always zero) for in-\nconsistent cases. The new bound allows to quantify the level of inconsistency of the\nsetting and shows how learning with inconsistent surrogates can have guarantees on\nsample complexity and optimization dif\ufb01culty. We apply our theory to two concrete\ncases: multi-class classi\ufb01cation with the tree-structured loss and ranking with the\nmean average precision loss. The results show the approximation-computation\ntrade-offs caused by inconsistent surrogates and their potential bene\ufb01ts.\n\n1\n\nIntroduction\n\nConsistency is a desirable property of any statistical estimator, which informally means that in the\nlimit of in\ufb01nite data, the estimator converges to the correct quantity. In the context of machine\nlearning algorithms based on surrogate loss minimization, we usually use the notion of Fisher\nconsistency, which means that the exact minimization of the expected surrogate loss leads to the exact\nminimization of the actual task loss. It can be shown that Fisher consistency is closely related to the\nquestion of in\ufb01nite-sample consistency (a.k.a. classi\ufb01cation calibration) of the surrogate loss with\nrespect to the task loss (see [2, 17] for a detailed review).\nThe property of in\ufb01nite-sample consistency (which we will refer to as simply consistency) shows\nthat the minimization of a particular surrogate is the right problem to solve, but it becomes especially\nattractive when one can actually minimize the surrogate, which is the case, e.g, when the surrogate\nis convex. Consistency of convex surrogates has been the central question of many studies for such\nproblems as binary classi\ufb01cation [2, 24, 19], multi-class classi\ufb01cation [23, 21, 1, 17], ranking [11, 4,\n5, 18, 15] and, more recently, structured prediction [7, 14].\nRecently, Osokin et al. [14] have pinpointed that in some cases minimizing a consistent convex\nsurrogate might be not suf\ufb01cient for ef\ufb01cient learning. In particular, when the number of possible\npredictions is large (which is typically the case in the settings of structured prediction and ranking)\nreaching adequately small value of the expected task loss can be practically impossible, because\none would need to optimize the surrogate to high accuracy, which requires an intractable number of\niterations of the optimization algorithm.\n\n\u2217National Research University Higher School of Economics\n\u2020CIFAR Fellow\n\u2021Samsung-HSE Joint Lab\n\u00a7Skolkovo Institute of Science and Technology\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIt also turns out [14] that the possibility of ef\ufb01cient learning is related to the structure of the task loss.\nThe 0-1 loss, which does not make distinction between different kinds of errors, shows the worst case\nbehavior. However, more structured losses, e.g., the Hamming distance between sequence labelings,\nallow ef\ufb01cient learning if the score vector is designed appropriately (for the Hamming distance, the\nscore for a complete con\ufb01guration should be decomposable into the sum of scores for individual\nelements).\nHowever, the analysis of Osokin et al. [14] gives non-trivial conclusions only for consistent surrogates.\nAt the same time it is known that inconsistent surrogates often work well in practice (for example, the\nCrammer-Singer formulation of multi-class SVM [8], or its generalization structured SVM [20, 22]).\nThere have indeed been several works to analyze inconsistent surrogates [12, 18, 5, 14], but they\nusually end the story with proving that some surrogate (or a family or surrogates) is not consistent.\nContributions. In this work, we look at the problem from a more quantitative angle and analyze\nto which extent inconsistent surrogates can be useful for learning. We focus on the same setting\nas [14] and generalize their results to the case of inconsistent surrogates (their bounds are trivial for\nthese cases) to be able to draw non-trivial conclusions. The main technical contribution consists in\na tighter lower bound on the calibration function (Theorem 3), which is strictly more general than\nthe bound of [14]. Notably, our bound is non-trivial in the case when the surrogate is not consistent\nand quanti\ufb01es to which degree learning with inconsistent surrogates is possible. We further study\nthe behavior of our bound in two practical scenarios: multi-class classi\ufb01cation with a tree-structured\nloss and ranking with the mean average precision (mAP) loss. For the tree-structured loss, our bound\nshows that there can be a trade-off between the best achievable accuracy and the speed of convergence.\nFor the mAP loss, we use our tools to study the (non-)existence of consistent convex surrogates of a\nparticular dimension (an important issue for the task of ranking [11, 4, 5, 18, 17]) and quantify to\nwhich extent our quadratic surrogate with the score vector of insuf\ufb01cient dimension is consistent.\nThis paper is organized as follows. First, we introduce the setting we work with in Section 2 and\nreview the key results of [14] in Section 3. In Section 4, we prove our main theoretical result, which\nis a new lower bound on the calibration function. In Section 5, we analyze the behavior of our bound\nfor the two different settings: multi-class classi\ufb01cation and ranking (the mean average precision loss).\nFinally, we review the related works and conclude in Section 6.\n\n2 Notation and Preliminaries\n\nIn this section, we introduce our setting, which closely follows [14]. We denote the input features\nby x \u2208 X where X is the input domain. The particular structure of X is not of the key importance\nfor this study. The output variables, that are in the center of our analysis, will be denoted by \u02c6y \u2208 \u02c6Y\nwith \u02c6Y being the set of possible predictions or the output domain.5 In such settings as structured\nprediction or ranking, the predictions are very high-dimensional and with some structure that is useful\nto model explicitly (for example, a sequence, permutation or image).\nThe central object of our study is the loss function L( \u02c6y, y) \u2265 0 that represents the cost of making the\nprediction \u02c6y \u2208 \u02c6Y when the ground-truth label is y \u2208 Y. Note that in some applications of interest\nthe sets \u02c6Y and Y are different. For example, in ranking with the mean average precision (mAP) loss\nfunction (see Section 5.2 and, e.g., [18] for the details), the set \u02c6Y consists of all the permutations\nof the items (to represent the ranking itself), but the set Y consists of all the subsets of items (to\nrepresent the set of relevant items, which is the ground-truth annotation in this setting). In this paper,\nwe only study the case when both \u02c6Y and Y are \ufb01nite. We denote the cardinality of \u02c6Y by k, and the\ncardinality of Y by m. In this case, the loss function can be encoded as a matrix L of size k \u00d7 m.\nIn many applications of interest, both quantities k and m are exponentially large in the size of the\nnatural dimension of the input x. For example, in the task of sequence labeling, both k and m are\nequal to the number of all possible sequences of symbols from a \ufb01nite alphabet. In the task of ranking\n(the mAP formulation), k is equal to the number of permutations of items and m is equal to the\nnumber of item subsets.\n\n5The output domain \u02c6Y itself can depend on the vector of input features x (for example, if x can represent\nsequences of different lengths and the length of the output sequence has to equal the length of the input), but we\nwill not use this dependency and omit it for brevity.\n\n2\n\n\fFollowing usual practices, we work with the prediction model de\ufb01ned by a (learned) vector-valued\nscore function f : X \u2192 Rk, which de\ufb01nes a scalar score f \u02c6y(x) for each possible output \u02c6y \u2208 \u02c6Y. The\n\ufb01nal prediction is then chosen as an output con\ufb01guration with the maximal score:\n\npred(f(x)) := argmax\n\n\u02c6y\u2208 \u02c6Y\n\nf \u02c6y(x).\n\n(1)\n\nRL(f) := IE(x,y)\u223cD L(cid:0)pred(f(x)), y(cid:1).\n\nIf the maximal score is given by multiple outputs \u02c6y (so-called ties), the predictor follows a simple\ndeterministic tie-breaking rule and picks the output appearing \ufb01rst in some prede\ufb01ned ordering on \u02c6Y.\nIn this setup, learning consists in \ufb01nding a score function f for which the predictor gives the smallest\nexpected loss with features x and labels y coming from an unknown data-generating distribution D:\n(2)\nThe quantity RL(f) is usually referred to as the actual (or population) risk based on the loss L.\nMinimizing the actual risk directly is usually dif\ufb01cult (because of non-convexity and non-continuity\nof the predictor (1)). The standard approach is to substitute (2) with another objective, a surrogate risk\n(or the \u03a6-risk), which is easier for optimization (in this paper, we only consider convex surrogates):\n(3)\nwhere we will refer to the function \u03a6 : Rk \u00d7 Y \u2192 R as the surrogate loss. To make the minimization\nof (3) well-de\ufb01ned, we will always assume the surrogate loss \u03a6 to be bounded from below and\ncontinuous.\nThe surrogate loss should be chosen in such a way that the minimization of (3) also leads to the\nminimization of (2), i.e., to the solution of the original problem. The property of consistency of\nthe surrogate loss is an approach to formalize this intuition, i.e., to guarantee that no matter the\ndata-generating distribution, minimizing (3) w.r.t. f implies minimizing (2) w.r.t. f as well (both\nof these are possible only in the limit of in\ufb01nite data and computational budget). Osokin et al.\n[14] quanti\ufb01ed what happens if the surrogate risk is minimized approximately by translating the\noptimization error of (3) to the optimization error of (2). The main goal of this paper is to generalize\nthis analysis to the cases when the surrogate is not consistent and to show that there can be trade-offs\nbetween the minimum value of the actual risk that can be achieved by minimizing an inconsistent\nsurrogate and the speed with which this minimum can be achieved.\n\nR\u03a6(f) := IE(x,y)\u223cD \u03a6(f(x), y),\n\n3 Calibration Functions and Consistency\n\nIn this section, we review the approach of Osokin et al. [14] for studying consistency in the context\nof structured prediction. The \ufb01rst part of the analysis establishes the connection between the\nminimization of the actual risk RL (2) and the surrogate risk R\u03a6 (3) via the so-called calibration\nfunction (see De\ufb01nition 1 [14, and references therein]). This step is usually called non-parametric (or\npointwise) because it does not explicitly model the dependency of the scores f := f(x) on the input\nvariables x. The second part of the analysis establishes the connection with an optimization algorithm\nallowing to make a statement about how many iterations would be enough to \ufb01nd a predictor that is\n(in expectation) within \u03b5 of the global minimum of the actual risk RL.\nNon-parametric analysis. The standard non-parametric setting considers all measurable score\nfunctions f to effectively ignore the dependency on the features x. As noted by [14], it is bene\ufb01cial to\nconsider a restricted set of the score functions FF that consists of all vector-valued Borel measurable\nfunctions f : X \u2192 F where F \u2286 Rk is a subspace of allowed score vectors. Compatibility of the\nsubspace F and the loss function L will be a crucial point of this paper. Note that the analysis is still\nnon-parametric because the dependence on x is not explicitly modeled.\nWithin the analysis, we will use the conditional actual and surrogate risks de\ufb01ned as the expectations\nof the corresponding losses w.r.t. a categorical distribution q on the set of annotations Y, m := |Y|:\n(4)\n\nqyL(pred(f ), y), \u03c6(f , q) :=\n\n(cid:88)m\n\n(cid:88)m\n\nqy\u03a6(f , y).\n\ny=1\n\n(cid:96)(f , q) :=\n\ny=1\n\nHereinafter, we represent an m-dimensional categorical distribution q as a point in the probability\nsimplex \u2206m and use the symbol qy to denote the probability of the y-th outcome. Using this notation,\nwe can rewrite the risk RL and surrogate risk R\u03a6 as\n\nRL(f) = IEx\u223cDX (cid:96)(f(x), IPD(\u00b7 | x)), R\u03a6(f) = IEx\u223cDX \u03c6(f(x), IPD(\u00b7 | x)),\n\n(5)\n\n3\n\n\fwhere DX is the marginal distribution of x and IPD(\u00b7 | x) denotes the conditional distribution of y\ngiven x (both de\ufb01ned for the joint data-generating distribution D).\nFor each score vector f \u2208 F and a distribution q \u2208 \u2206m over ground-truth labels, we now de\ufb01ne the\nexcess actual and surrogate risks\n\n\u03b4\u03c6(f , q) = \u03c6(f , q) \u2212 inf\n\u02c6f\u2208F\n\n\u03c6( \u02c6f , q),\n\n\u03b4(cid:96)(f , q) = (cid:96)(f , q) \u2212 inf\n\u02c6f\u2208Rk\n\n(cid:96)( \u02c6f , q),\n\n(6)\n\nwhich show how close the current conditional actual and surrogate risks are to the corresponding\nminimal achievable conditional risks (depending only on the distribution q). Note that the two in\ufb01ma\nin (6) are de\ufb01ned w.r.t. different sets of score vectors. For the surrogate risk, the in\ufb01mum is taken\nw.r.t. the set of allowed scores F capturing only the scores obtainable by the learning process. For the\nactual risk, the in\ufb01mum is taken w.r.t. the set of all possible scores Rk including score vectors that\ncannot be learned. This distinction is important when analyzing inconsistent surrogates and allows to\ncharacterize the approximation error of the selected function class.6\nWe are now ready to de\ufb01ne the calibration function, which is the \ufb01nal object of the non-parametric\npart of the analysis. Calibration functions directly show how well one needs to minimize the surrogate\nrisk to guarantee that the excess of the actual risk is smaller than \u03b5.\nDe\ufb01nition 1 (Calibration function, [14]). For a task loss L, a surrogate loss \u03a6, a set of feasible\nscores F, the calibration function H\u03a6,L,F (\u03b5) is de\ufb01ned as:\n\nH\u03a6,L,F (\u03b5) := inf\n\nf\u2208F , q\u2208\u2206m\n\n\u03b4\u03c6(f , q)\n\u03b4(cid:96)(f , q) \u2265 \u03b5,\n\ns.t.\n\n(7)\n\n(8)\n\n(9)\n\nwhere \u03b5 \u2265 0 is the target accuracy. We set H\u03a6,L,F (\u03b5) to +\u221e when the feasible set is empty.\nBy construction, H\u03a6,L,F is non-decreasing on [0, +\u221e), H\u03a6,L,F (\u03b5) \u2265 0 and H\u03a6,L,F (0) = 0. The\ncalibration function also provides the so-called excess risk bound\n\nH\u03a6,L,F (\u03b4(cid:96)(f , q)) \u2264 \u03b4\u03c6(f , q), \u2200f \u2208 F, \u2200q \u2208 \u2206m,\n\nwhich implies the formal connection between the surrogate and task risks [14, Theorem 2].\nThe calibration function can fully characterize consistency of the setting de\ufb01ned by the surrogate loss,\nthe subspace of scores and the task loss. The maximal value of \u03b5 at which the calibration function\nH\u03a6,L,F (\u03b5) equals zero shows the best accuracy on the actual loss that can be obtained [14, Theorem\n6]. The notion of level-\u03b7 consistency captures this effect.\nDe\ufb01nition 2 (level-\u03b7 consistency, [14]). A surrogate loss \u03a6 is consistent up to level \u03b7 \u2265 0 w.r.t. a\ntask loss L and a set of scores F if and only if the calibration function satis\ufb01es H\u03a6,L,F (\u03b5) > 0 for\nall \u03b5 > \u03b7 and there exists \u02c6\u03b5 > \u03b7 such that H\u03a6,L,F (\u02c6\u03b5) is \ufb01nite.\nThe case of level-0 consistency corresponds to the classical consistent surrogate and Fisher consistency.\nWhen \u03b7 > 0, the surrogate is not consistent, meaning that the actual risk cannot be minimized\nglobally. However, Osokin et al. [14, Appendix E.4] give an example where even though constructing\na consistent setting is possible (by the choice of the score subspace F), it might still be bene\ufb01cial\nto use only a level-\u03b7 consistent setting because of the exponentially faster growth of the calibration\nfunction. The main contribution of this paper is a lower bound on the calibration function (Theorem 3),\nwhich is non-zero for \u03b7 > 0 and thus can be used to obtain convergence rates in inconsistent settings.\nOptimization and learning guarantees; normalizing the calibration function. Osokin et al. [14]\nnote that the scale of the calibration function is not de\ufb01ned, i.e., if one multiplies the surrogate loss\nby some positive constant, the calibration function is multiplied by the same constant as well. One\nway to de\ufb01ne a \u201cnatural normalization\u201d is to use a scale-invariant convergence rate of a stochastic\noptimization algorithm. Osokin et al. [14, Section 3.3] applied the classical online ASGD [13] (under\nthe well-speci\ufb01cation assumption) and got the sample complexity (and the convergence rate of ASGD\nat the same time) result saying that N\u2217 steps of ASGD are suf\ufb01cient to get \u03b5-accuracy on the task\nloss (in expectation), where N\u2217 is computed as follows:\nN\u2217 := 4D2M 2\n\u03a6,L,F (\u03b5)\n\n(10)\n\n\u02c7H 2\n\n.\n\n6Note that Osokin et al. [14] de\ufb01ne the excess risks by taking both in\ufb01ma w.r.t. the the set of allowed\nscores F, which is subtly different from us. The results of the two setups are equivalent in the cases of consistent\nsurrogates, which are the main focus of Osokin et al. [14], but can be different in inconsistent cases.\n\n4\n\n\fHere the quantity N\u2217 depends on a convex lower bound \u02c7H\u03a6,L,F (\u03b5) on the calibration func-\ntion H\u03a6,L,F (\u03b5) and the constants D, M, which appear in the convergence rate of ASGD: D is\nan upper bound on the norm of an optimal solution and M 2 is an upper bound on the expected square\nnorm of the stochastic gradient. Osokin et al. [14] show how to bound the constant DM for a very\nspeci\ufb01c quadratic surrogate de\ufb01ned below (see Section 3.1).\n\n3.1 Bounds for the Quadratic Surrogate\n\nThe major complication in applying and interpreting the theoretical results presented in Section 3\nis the complexity of computing the calibration function. Osokin et al. [14] analyzed the calibration\nfunction only for the quadratic surrogate\n\n(cid:88)\n\n2k(cid:107)f + L(:, y)(cid:107)2\n\n2 = 1\n2k\n\n\u03a6quad(f , \u02c6y) := 1\n\n\u02c6y\u2208 \u02c6Y (f 2\n\n\u02c6y + 2f \u02c6yL( \u02c6y, y) + L( \u02c6y, y)2).\n\n(11)\n\nFor any task loss L, this surrogate is consistent whenever the subspace of allowed scores is rich\nenough, i.e., the subspace of scores F fully contains span(L). To connect with optimization, we\nassume a parametrization of the subspace F as a span of the columns of some matrix F , i.e.,\nF = span(F ) = {f = F \u03b8 | \u03b8 \u2208 Rr}.7 In the interesting settings, the dimension r is much smaller\nthan both k and m. Note that to compute the gradient of the objective (11) w.r.t. the parameters \u03b8,\none needs to compute matrix products F TF \u2208 Rr\u00d7r and F TL(:, y) \u2208 Rr, which are usually both of\nfeasible sizes, but require exponentially big sum (k summands) inside. Computing these quantities\ncan be seen as some form of inference required to run the learning process.\nOsokin et al. [14] proved a lower bound on the calibration functions for the quadratic surro-\ngate (11) [14, Theorem 7], which we now present to contrast our result presented in Section 4.\nWhen the subspace of scores F contains span(L), span(L) \u2286 F, implying that the setting is con-\nsistent, the calibration function is bounded from below by mini(cid:54)=j\n, where PF is the\northogonal projection on the subspace F and \u2206ij := ei \u2212 ej \u2208 Rk with ec being the c-th basis vector\nof the standard basis in Rk. They also showed that for some very structured losses (Hamming and\nblock 0-1 losses), the quantity k(cid:107)PF \u2206ij(cid:107)2\n2 is not exponentially large and thus the calibration function\nsuggests that ef\ufb01cient learning is possible. One interesting case not studied by Osokin et al. [14] is\nthe situation where the subspace of scores F does not fully contain the subspace span(L). In this\ncase, the surrogate might not be consistent but still lead to effective and ef\ufb01cient practical algorithms.\nNormalizing the calibration function. The normalization constant DM appearing in (10) can\nalso be computed for the quadratic surrogate (11) under the assumption of well-speci\ufb01cation (see\n[14, Appendix F] for details).\nrRQmax), \u03be(z) =\nz2 + z, where Lmax denotes the maximal value of all elements in L, \u03ba(F ) is the condition number of\nthe matrix F and r in an upper bound on the rank of F. The constants R and Qmax come from the\nkernel ASGD setup and, importantly, depend only on the data distribution, but not on the loss L or\nscore matrix F . Note that for a given subspace F, the choice of matrix F is arbitrary and it can always\nbe chosen as an orthonormal basis of F giving a \u03ba(F ) of one. However, such F can lead to inef\ufb01cient\nprediction (1), which makes the whole framework less appealing. Another important observation\ncoming from the value of DM is the justi\ufb01cation of the 1\n\nIn particular, we have DM = L2\n\nk scaling in front of the surrogate (11).\n\n2k(cid:107)PF \u2206ij(cid:107)2\n\nmax\u03be(\u03ba(F )\n\n\u221a\n\n\u03b52\n\n2\n\n4 Calibration Function for Inconsistent Surrogates\n\nOur main result generalizes the Theorem 7 of [14] to the case of inconsistent surrogates (the key\ndifference consists in the absence of the assumption span(L) \u2286 F).\nTheorem 3 (Lower bound on the calibration function H\u03a6quad,L,F (\u03b5)). For any task loss L, its\nquadratic surrogate \u03a6quad, and a score subspace F, the calibration function is bounded from below:\n(12)\n\n, where \u03beij(v) :=(cid:13)(cid:13) LT(vIk \u2212 PF )\u2206ij\n\nH\u03a6quad,L,F (\u03b5) \u2265 min\ni(cid:54)=j\n\n(\u03b5v\u2212\u03beij (v))2\n2k(cid:107)PF \u2206ij(cid:107)2\n\n(cid:13)(cid:13)\u221e,\n\nwhere PF is the orthogonal projection on the subspace F, (x)2\n+ := [x > 0]x2 is the truncation of\nthe parabola to its right branch and \u2206ij := ei \u2212 ej \u2208 Rk with ec \u2208 Rk being the c-th column of the\n7We do a pointwise analysis in this section, so we are not modeling the dependence of \u03b8 on the features x.\nHowever, in an actual implementation, the vector \u03b8 should be a function of the features x coming from some\n\ufb02exible family such as a RKHS or some neural networks.\n\nmax\nv\u22650\n\n+\n\n2\n\n5\n\n\fidentity matrix Ik. By convention, if both numerator and denominator of (12) equal zero the whole\nbound equals zero. If only the denominator equals zero then the whole bound equals in\ufb01nity (the\nparticular pair of i and j is effectively not considered).\n\nv\n\nv2\n\n2k(cid:107)PF \u2206ij(cid:107)2\n\nThe proof of Theorem 3 starts with using the idea of [14] to compute the calibration function by\nsolving a collection of convex quadratic programs (QPs). Then we diverge from the proof of [14]\n(because it leads to a non-informative bound in inconsistent settings). For each of the formulated\nQPs, we construct a dual by using the approach of Dorn [10]. The dual of Dorn is convenient for our\nneeds because it does not require inverting the matrix de\ufb01ning the quadratic terms (compared to the\nstandard Lagrangian dual). The complete proof is given in Appendix B.\nRemark 4. The numerator of the bound (12) explicitly speci\ufb01es the point at which the bound becomes\nnon-zero, implying level-\u03b7 consistency with \u03b7 = \u03beij (v)\nfor the values of i, j, v that are active for a\nparticular \u03b5. The quantity\nbounds the weight of the \u03b52 term in the calibration function\nafter it leaves zero. Moving the quantity v de\ufb01nes the trade-off between the slope, which is related to\nthe convergence speed of the algorithm, and the value of \u03b7 de\ufb01ning the best achievable accuracy.\nRemark 5. If we have conditions of Theorem 7 of [14] satis\ufb01ed, i.e., span(L) \u2286 F, then the\nvector LT(Ik \u2212 PF )\u2206ij equals zero and \u03beij(v) becomes |v \u2212 1| (cid:107)LT\u2206ij(cid:107)\u221e, which equals zero\nwhen v = 1. It might seem that having v > 1 can potentially give us a tighter lower bound than\nTheorem 7 [14] even in consistent cases. However, the quantity (cid:107)LT\u2206ij(cid:107)\u221e upper bounds the\nmaximal possible (w.r.t. the conditional distribution IPD(\u00b7 | x)) value of the excess task loss for a\n\ufb01xed pair i, j leading to the identity v\u03b5 \u2212 |v \u2212 1| (cid:107)LT\u2206ij(cid:107)\u221e = (cid:107)LT\u2206ij(cid:107)\u221e for \u03b5 = (cid:107)LT\u2206ij(cid:107)\u221e and\nv \u2265 1. Together with the convexity of the function (x)2\n+, this implies that the best possible value of v\nin consistent settings equals one.\nRemark 6. Setting v in (12) to any non-negative constant gives a valid lower bound. In particular,\nsetting v to 1 (while potentially making the bound less tight) highlights the separation between the\nweight of the quadratic term and the best achievable accuracy \u03b7. The bound now reads as follows:\n\n2\n\n, where \u03beij :=(cid:13)(cid:13) LT(Ik \u2212 PF )\u2206ij\n\n(cid:13)(cid:13)\u221e.\n\nH\u03a6quad,L,F (\u03b5) \u2265 min\ni(cid:54)=j\n\n(\u03b5\u2212\u03beij )2\n2k(cid:107)PF \u2206ij(cid:107)2\n\n+\n\n2\n\n(13)\n\nNote that the weight of the \u03b52 term now equals the corresponding coef\ufb01cient of the bound of\nTheorem 7 [14]. Notably, this weight depends only on the score subspace F, but not on the loss L.\n\n5 Bounds for Particular Losses\n5.1 Multi-Class Classi\ufb01cation with the Tree-Structured Loss\nAs an illustration of the obtained lower bound (12), we consider the task of multi-class classi\ufb01cation\nand the tree-structured loss, which is de\ufb01ned for a weighted tree built on labels (such trees on labels\noften appear in settings with large number of labels, e.g., extreme classi\ufb01cation [6]). Leaves in\nthe tree correspond to the class labels \u02c6y \u2208 \u02c6Y = Y and the loss function is de\ufb01ned as the length\nof the path \u03c1 between the leaves, i.e., Ltree(y, \u02c6y) := \u03c1(y, \u02c6y). To compute the lower bound exactly,\nAppendix C for an example of such a tree) and that(cid:80)D\u22121\nwe assume that the number of children ds and the weights of the edges connecting a node with its\n2 are equal for all the nodes of the same depth level s = 0, . . . , D \u2212 1 (see Figure 2 in\nchildren \u03b1s\ns=0 \u03b1s = 1, which normalizes Lmax to one.\nTo de\ufb01ne the score matrix Ftree,s0, we set the consistency depth s0 \u2208 {1, . . . , D} and restrict the\nhave Ftree,s0 = span{(cid:80)\nscores f to be equal for the groups (blocks) of leaves that have the same ancestor on the level s0. Let\nB(i) be the set of leaves that have the same ancestor as a leaf i at the depth s0. With this notation, we\ni\u2208B(j) ei | j = 1, . . . , k}. Theorem 3 gives us the bound (see Appendix C):\n(cid:80)\ni\u2208B(j) \u03c1(i, j) =(cid:80)D\u22121\n(14)\n\nand \u03b7s0 := maxi\u2208B(j) \u03c1(i, j) =\n\u03b1s are the number of blocks, the average and maximal distance within a block, respectively.\nNow we discuss the behavior of the bound (14) when changing the truncation level s0. With the\ngrowth of s0, the level of consistency \u03b7s0 goes to 0 indicating that more labels can be distinguished.\n2 \u2264 \u00af\u03c1s0 for the trees we consider and thus the coef\ufb01cient in front of the\nAt the same time, we have \u03b7s0\n\nwhere bs0, \u00af\u03c1s0 := 1|B(j)|\n\n(\u03b5) \u2265 [\u03b5 > \u03b7s0] (\u03b7s0\u2212 \u00af\u03c1s0 +\u03b1s0\u22121)2\n\u03b7s0\n2 +\u03b1s0\u22121)2\nds(cid:48) )\u22121\ns(cid:48)=s0\nds(cid:48)\ns(cid:48)=s0\n\n((cid:81)s\n(cid:81)s\n\n(cid:80)D\u22121\n\nH\u03a6quad,Ltree,Ftree,s0\n\n(\u03b5\u2212 \u03b7s0\n4bs0\n\n2 )2\n\ns=s0\n\ns=s0\n\n\u03b1s\n\n+\n\n,\n\n(\n\n6\n\n\f, which means that the lower bound on the calibration\n\u03b52 term can be bounded from above by 1\n4bs0\nfunction decreases at an exponential rate with the growth of s0. These arguments show the trade-off\nbetween the level of consistency and the coef\ufb01cient of \u03b52 in the calibration function.\nFinally, note that the mixture of 0-1 and block 0-1 losses considered in [14, Appendix E.4] is an\ninstance of the tree-structured loss with D = 2. Their bound [14, Proposition 17] matches (14) up to\nthe difference in the de\ufb01nition of the calibration function (they do not have the [\u03b5 > \u03b7s0 ] multiplier\nbecause they do not consider pairs of labels that fall in the same block).\n\n5.2 Mean Average Precision (mAP) Loss for Ranking\nThe mAP loss, which is a popular way of measuring the quality of ranking, has attracted signi\ufb01cant\nattention from the consistency point of view [4, 5, 18]. In the mAP setting, the ground-truth labels\nare binary vectors y \u2208 Y = {0, 1}r that indicate the items relevant for the query (a subset of r\nitems-to-rank) and the prediction consists in producing a permutation of items \u03c3 \u2208 \u02c6Y, \u02c6Y = Sr. The\nmAP loss is based on averaging the precision at different levels of recall and is de\ufb01ned as follows:\n\nLmAP(\u03c3, y) := 1 \u2212 1|y|\n\n1\n\n\u03c3(p)\n\nr(cid:88)\n\n\u03c3(p)(cid:88)\n\ny\u03c3\u22121(q) = 1 \u2212 r(cid:88)\n\np(cid:88)\n\n1\n\nmax(\u03c3(p),\u03c3(q))\n\nypyq\n|y| ,\n\n(15)\n\n|y| := (cid:80)r\n\np:yp=1\n\nq=1\n\np=1\n\nq=1\n\n1\n\n2 r(r + 1).8 The matrix FmAP \u2208 Rr!\u00d7 1\n\nwhere \u03c3(p) is the position of an item p in a permutation \u03c3, \u03c3\u22121 is the inverse permutation and\np=1 yp. The second identity provides a convenient form of writing the mAP loss [18]\n2 r(r+1)\nshowing that the loss matrix LmAP is of rank at most 1\nmax(\u03c3(p),\u03c3(q)) is a natural candidate to de\ufb01ne the score subspace F to\nsuch that (FmAP)\u03c3,pq :=\nget the consistent setting with the quadratic surrogate (11) (Eq. (15) implies that span(LmAP) =\nspan(FmAP)).\nHowever, as noted in Section 6 of [18], although the matrix FmAP is convenient from the consistency\npoint of view (in the setup of [18]), it leads to the prediction problem max\u03c3\u2208Sr (FmAP\u03b8)\u03c3, which is a\nquadratic assignment problem (QAP), and most QAPs are NP-hard.\nTo be able to predict ef\ufb01ciently, it would be bene\ufb01cial to have the matrix F with r columns such\nthat sorting the r-dimensional \u03b8 would give the desired permutation. It appears that it is possible to\nconstruct such a matrix by selecting a subset of columns of matrix FmAP. We de\ufb01ne Fsort \u2208 Rr!\u00d7r by\n\u03c3(p). A solution of the prediction problem max\u03c3\u2208Sr (Fsort\u03b8)\u03c3 is simply a permutation\n(Fsort)\u03c3,p := 1\nthat sorts the elements of \u03b8 \u2208 Rr in the decreasing order (this statement follows from the fact that we\n\ncan always increase the score (Fsort\u03b8)\u03c3 =(cid:80)r\n\n\u03b8p\n\u03c3(p) by swapping a pair of non-aligned items).\n\np=1\n\n2 = O(r)). We also know that (cid:107)PFsort\u2206\u03c0\u03c9(cid:107)2\n\nMost importantly for our study, the columns of the matrix Fsort are a subset of the columns of the\nmatrix FmAP, which indicates that learning with the convenient matrix Fsort might be suf\ufb01cient for\nthe mAP loss. In what follows, we study the calibration functions for the loss matrix LmAP and score\nmatrices FmAP and Fsort. In Figure 1a-b, we plot the calibration functions for both FmAP and Fsort\nand the lower bounds given by Theorem 3. All the curves were obtained for r = 5 (computing the\nexact values of the calibration functions is exponential in r).\nNext, we study the behavior of the lower bound (12) for large values of r.\nIn Lemma 13 of\nAppendix D, we show that the denominator of the bound (12) is not exponential in r (we have\n2 \u2264 (cid:107)PFmAP\u2206\u03c0\u03c9(cid:107)2\n2r!(cid:107)PFsort\u2206\u03c0\u03c9(cid:107)2\n2 (because Fsort is a\nsubspace of FmAP), which implies that the calibration function of the consistent setting grows not\nfaster than the one of the inconsistent setting. We can also numerically compute a lower bound on\nthe point \u03b7 until which the calibration function is guaranteed to be zero (for this we simply pick two\ny \u2264 \u03be\u03c0,\u03c9(1)).\nFigure 1c shows that the level of inconsistency \u03b7 grows with the growth of r, which makes the method\nless appealing for large-scale settings.\nFinally, note that to run the ASGD algorithm for the quadratic surrogate (11), mAP loss and score\nmatrix Fsort, we need to ef\ufb01ciently compute F T\nsortLmAP(:, y). Lemmas 11 and 12 (see\nAppendix D) provide linear in r time algorithms for doing this. The condition number of Fsort grows\nas \u0398(log r) keeping the sample complexity bound (10) well behaved.\n\npermutations \u03c0, \u03c9 and a labeling y that delivers large values of(cid:0)LT\n\nmAP(Ik \u2212 PFsort)\u2206ij\n\nsortFsort and F T\n\n(cid:1)\n\n8Ramaswamy & Agarwal [17, Proposition 21] showed that the rank of LmAP is a least 1\n\n2 r(r + 1) \u2212 2.\n\n7\n\n\f(a): Consistent surrogate with FmAP\n\n(b): Inconsistent surrogate with Fsort\n\n(c): LB on \u03b7\n\nFigure 1: Plot (a) shows the calibration function H\u03a6quad,LmAP,FmAP (\u03b5) for LmAP (red line) obtained\nnumerically. The solid blue line [14, Theorem 7] is its lower bound, LB, and the solid black line\nis the worst case bound obtained for F = Ir! (which means not constructing an appropriate low-\ndimension F). Difference between the blue and the black lines is exponential (proportional to r!).\nThe dashed blue line illustrates the inconsistent surrogate (note that it is zero for small \u03b5 > 0, but\nthen grows faster than the solid blue line \u2013 the consistent setting). Plot (b) shows the calibration\nfunction H\u03a6quad,LmAP,Fsort(\u03b5) (red line) obtained numerically (this setting is level-\u03b7 consistent for\n\u03b7 \u2248 0.08). The blue line (Theorem 3) is its lower bound for the optimal value of v and the green\nline is the bound for v = 1 (easier to obtain). The black line shows the zero-valued trivial bound\nfrom [14]. The dashed blue line shows H\u03a6quad,LmAP,FmAP (\u03b5) for the consistent surrogate to compare\nthe two settings. Note that in both plots (a) and (b) the solid blue lines are the lower bounds of\nthe corresponding calibration functions (red lines), but the dashed blue lines are not (shown for\ncomparison purposes). Plot (c) shows a lower bound on the point \u03b7 where the exact calibration\nfunction H\u03a6quad,LmAP,Fsort(\u03b5) stops being zero, indicating the level of consistency (De\ufb01nition 2).\n6 Discussion\nRelated works. Despite a large number of works studying consistency and calibration in the context\nof machine learning, there have been relatively few attempts to obtain guarantees for inconsistent\nsurrogates. The most popular approach is to study consistency under so-called low noise conditions.\nSuch works show that under certain assumptions on the data generating distribution D (usually these\nassumptions are on the conditional distribution of labels and are impossible to verify for real data)\nthe surrogate of interest becomes consistent, whereas being inconsistent for general D. Duchi et al.\n[11] established such a result for the value-regularized linear surrogate for ranking (which resembles\nthe pairwise disagreement, PD, loss). Ramaswamy et al. [18] provided similar results for the mAP\nand PD losses for ranking and their quadratic surrogate. Similarly to our conclusions, the mAP\nsurrogate of [18] is consistent with 1\n2 r(r + 1) parameters learned and only low-noise consistent\nwith r parameters learned. Long & Servedio [12] introduced a notion of realizable consistency\nw.r.t. a function class (they considered linear predictors), which is consistency w.r.t. the function\nclass assuming the data distribution such that labels depend on features deterministically with this\ndependency being in the correct function class. Ben-David et al. [3] worked in the agnostic setting for\nbinary classi\ufb01cation (no assumptions on the underlying D) and provided guarantees on the error of\nlinear predictors when the margin was bounded by some constant (their work reduces to consistency\nin the limit case, but is more general).\nConclusion. Differently from the previous approaches, we do not put constraints on the data\ngenerating distribution, but instead study the connection between the surrogate and task losses by the\nmeans of the calibration function (following [14]), which represents the worst-case scenario. For the\nquadratic surrogate (11), we can bound the calibration function from below in such a way that the\nbound is non-trivial in inconsistent settings (differently from [14]). Our bound quanti\ufb01es the level\nof inconsistency of a setting (de\ufb01ned by the used surrogate loss, task loss and parametrization of\nthe scores) and allows to analyze when learning with inconsistent surrogates can be bene\ufb01cial. We\nillustrate the behavior of our bound for two tasks (multi-class classi\ufb01cation and ranking) and show\nexamples of conclusions that our approach can give.\nFuture work. It would be interesting to combine our quantitative analysis with the constraints on the\ndata distribution, which might give adaptive calibration functions (in analogy to adaptive convergence\nrates in convex optimization: for example, SAGA [9] has a linear convergence rate for strongly convex\nobjectives and 1/t rate for non-strongly convex ones), and with the recent results of Pillaud-Vivien\net al. [16] showing that under some low-noise assumptions even slow convergence of the surrogate\nobjective can imply exponentially fast convergence of the task loss.\n\n8\n\n00.20.40.60.80246800.20.40.60.80246810110210300.20.40.60.81\fAcknowledgements\n\nThis work was partly supported by Samsung Research, by Samsung Electronics, by the Ministry\nof Education and Science of the Russian Federation (grant 14.756.31.0001) and by the NSERC\nDiscovery Grant RGPIN-2017-06936.\n\nReferences\n[1] \u00c1vila Pires, Bernardo, Ghavamzadeh, Mohammad, and Szepesv\u00e1ri, Csaba. Cost-sensitive multiclass\n\nclassi\ufb01cation risk bounds. In ICML, 2013.\n\n[2] Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] Ben-David, Shai, Loker, David, Srebro, Nathan, and Sridharan, Karthik. Minimizing the misclassi\ufb01cation\n\nerror rate using a surrogate convex loss. 2012.\n\n[4] Buffoni, David, Gallinari, Patrick, Usunier, Nicolas, and Calauz\u00e8nes, Cl\u00e9ment. Learning scoring functions\n\nwith order-preserving losses and standardized supervision. In ICML, 2011.\n\n[5] Calauz\u00e8nes, Cl\u00e9ment, Usunier, Nicolas, and Gallinari, Patrick. On the (non-)existence of convex, calibrated\n\nsurrogate losses for ranking. In NIPS, 2012.\n\n[6] Choromanska, Anna, Agarwal, Alekh, and Langford, John. Extreme multi class classi\ufb01cation. In NIPS\n\nWorkshop: eXtreme Classi\ufb01cation, 2013.\n\n[7] Ciliberto, Carlo, Rudi, Alessandro, and Rosasco, Lorenzo. A consistent regularization approach for\n\nstructured prediction. In NIPS, 2016.\n\n[8] Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. Journal of Machine Learning Research (JMLR), 2:265\u2013292, 2001.\n\n[9] Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. SAGA: A fast incremental gradient method\n\nwith support for non-strongly convex composite objectives. In NIPS, 2014.\n\n[10] Dorn, William S. Duality in quadratic programming. Quarterly of Applied Mathematics, 18(2):155\u2013162,\n\n1960.\n\n[11] Duchi, John C., Mackey, Lester W., and Jordan, Michael I. On the consistency of ranking algorithms. In\n\nICML, 2010.\n\n[12] Long, Phil and Servedio, Rocco. Consistency versus realizable H-consistency for multiclass classi\ufb01cation.\n\nIn ICML, 2013.\n\n[13] Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[14] Osokin, Anton, Bach, Francis, and Lacoste-Julien, Simon. On structured prediction theory with calibrated\n\nconvex surrogate losses. In NIPS, 2017.\n\n[15] Pedregosa, Fabian, Bach, Francis, and Gramfort, Alexandre. On the consistency of ordinal regression\n\nmethods. Journal of Machine Learning Research (JMLR), 18(55):1\u201335, 2017.\n\n[16] Pillaud-Vivien, Loucas, Rudi, Alessandro, and Bach, Francis. Exponential convergence of testing error for\n\nstochastic gradient methods. In COLT, 2018.\n\n[17] Ramaswamy, Harish G. and Agarwal, Shivani. Convex calibration dimension for multiclass loss matrices.\n\nJournal of Machine Learning Research (JMLR), 17(14):1\u201345, 2016.\n\n[18] Ramaswamy, Harish G., Agarwal, Shivani, and Tewari, Ambuj. Convex calibrated surrogates for low-rank\n\nloss matrices with applications to subset ranking losses. In NIPS, 2013.\n\n[19] Steinwart, Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26\n\n(2):225\u2013287, 2007.\n\n[20] Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS, 2003.\n[21] Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classi\ufb01cation methods. Journal of\n\nMachine Learning Research (JMLR), 8:1007\u20131025, 2007.\n\n[22] Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research (JMLR), 6:1453\u20131484, 2005.\n\n[23] Zhang, Tong. Statistical analysis of some multi-category large margin classi\ufb01cation methods. Journal of\n\nMachine Learning Research (JMLR), 5:1225\u20131251, 2004.\n\n[24] Zhang, Tong. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk mini-\n\nmization. Annals of Statistics, 32(1):56\u2013134, 2004.\n\n9\n\n\f", "award": [], "sourceid": 386, "authors": [{"given_name": "Kirill", "family_name": "Struminsky", "institution": "NRU HSE"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "MILA, Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Anton", "family_name": "Osokin", "institution": "NRU HSE, Moscow, Russia"}]}