{"title": "On Structured Prediction Theory with Calibrated Convex Surrogate Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 302, "page_last": 313, "abstract": "We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called \"calibration function\" relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.", "full_text": "On Structured Prediction Theory with Calibrated\n\nConvex Surrogate Losses\n\nAnton Osokin\n\nINRIA/ENS\u2217, Paris, France\n\nHSE\u2020, Moscow, Russia\n\nFrancis Bach\n\nINRIA/ENS\u2217, Paris, France\n\nAbstract\n\nSimon Lacoste-Julien\n\nMILA and DIRO\n\nUniversit\u00e9 de Montr\u00e9al, Canada\n\nWe provide novel theoretical insights on structured prediction in the context of\nef\ufb01cient convex surrogate loss minimization with consistency guarantees. For any\ntask loss, we construct a convex surrogate that can be optimized via stochastic\ngradient descent and we prove tight bounds on the so-called \u201ccalibration function\u201d\nrelating the excess surrogate risk to the actual risk. In contrast to prior related\nwork, we carefully monitor the effect of the exponential number of classes in the\nlearning guarantees as well as on the optimization complexity. As an interesting\nconsequence, we formalize the intuition that some task losses make learning harder\nthan others, and that the classical 0-1 loss is ill-suited for structured prediction.\n\n1\n\nIntroduction\n\nStructured prediction is a sub\ufb01eld of machine learning aiming at making multiple interrelated\npredictions simultaneously. The desired outputs (labels) are typically organized in some structured\nobject such as a sequence, a graph, an image, etc. Tasks of this type appear in many practical domains\nsuch as computer vision [34], natural language processing [42] and bioinformatics [19].\nThe structured prediction setup has at least two typical properties differentiating it from the classical\nbinary classi\ufb01cation problems extensively studied in learning theory:\n1. Exponential number of classes: this brings both additional computational and statistical challenges.\nBy exponential, we mean exponentially large in the size of the natural dimension of output, e.g., the\nnumber of all possible sequences is exponential w.r.t. the sequence length.\n2. Cost-sensitive learning: in typical applications, prediction mistakes are not all equally costly.\nThe prediction error is usually measured with a highly-structured task-speci\ufb01c loss function, e.g.,\nHamming distance between sequences of multi-label variables or mean average precision for ranking.\nDespite many algorithmic advances to tackle structured prediction problems [4, 35], there have been\nrelatively few papers devoted to its theoretical understanding. Notable recent exceptions that made\nsigni\ufb01cant progress include Cortes et al. [13] and London et al. [28] (see references therein) which\nproposed data-dependent generalization error bounds in terms of popular empirical convex surrogate\nlosses such as the structured hinge loss [44, 45, 47]. A question not addressed by these works is\nwhether their algorithms are consistent: does minimizing their convex bounds with in\ufb01nite data lead\nto the minimization of the task loss as well? Alternatively, the structured probit and ramp losses are\nconsistent [31, 30], but non-convex and thus it is hard to obtain computational guarantees for them.\nIn this paper, we aim at getting the property of consistency for surrogate losses that can be ef\ufb01ciently\nminimized with guarantees, and thus we consider convex surrogate losses.\nThe consistency of convex surrogates is well understood in the case of binary classi\ufb01cation [50, 5, 43]\nand there is signi\ufb01cant progress in the case of multi-class 0-1 loss [49, 46] and general multi-\n\n\u2217DI \u00c9cole normale sup\u00e9rieure, CNRS, PSL Research University\n\u2020National Research University Higher School of Economics\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fclass loss functions [3, 39, 48]. A large body of work speci\ufb01cally focuses on the related tasks of\nranking [18, 9, 40] and ordinal regression [37].\nContributions. In this paper, we study consistent convex surrogate losses speci\ufb01cally in the context\nof an exponential number of classes. We argue that even while being consistent, a convex surrogate\nmight not allow ef\ufb01cient learning. As a concrete example, Ciliberto et al. [10] recently proposed a\nconsistent approach to structured prediction, but the constant in their generalization error bound can\nbe exponentially large as we explain in Section 5. There are two possible sources of dif\ufb01culties from\nthe optimization perspective: to reach adequate accuracy on the task loss, one might need to optimize\na surrogate loss to exponentially small accuracy; or to reach adequate accuracy on the surrogate loss,\none might need an exponential number of algorithm steps because of exponentially large constants\nin the convergence rate. We propose a theoretical framework that jointly tackles these two aspects\nand allows to judge the feasibility of ef\ufb01cient learning. In particular, we construct a calibration\nfunction [43], i.e., a function setting the relationship between accuracy on the surrogate and task\nlosses, and normalize it by the means of convergence rate of an optimization algorithm.\nAiming for the simplest possible application of our framework, we propose a family of convex\nsurrogates that are consistent for any given task loss and can be optimized using stochastic gradient\ndescent. For a special case of our family (quadratic surrogate), we provide a complete analysis\nincluding general lower and upper bounds on the calibration function for any task loss, with exact\nvalues for the 0-1, block 0-1 and Hamming losses. We observe that to have a tractable learning\nalgorithm, one needs both a structured loss (not the 0-1 loss) and appropriate constraints on the\npredictor, e.g., in the form of linear constraints for the score vector functions. Our framework also\nindicates that in some cases it might be bene\ufb01cial to use non-consistent surrogates. In particular, a\nnon-consistent surrogate might allow optimization only up to speci\ufb01c accuracy, but exponentially\nfaster than a consistent one.\nWe introduce the structured prediction setting suitable for studying consistency in Sections 2 and 3.\nWe analyze the calibration function for the quadratic surrogate loss in Section 4. We review the\nrelated works in Section 5 and conclude in Section 6.\n\n2 Structured prediction setup\nIn structured prediction, the goal is to predict a structured output y \u2208 Y (such as a sequence, a graph,\nan image) given an input x \u2208 X . The quality of prediction is measured by a task-dependent loss\nfunction L( \u02c6y, y | x) \u2265 0 specifying the cost for predicting \u02c6y when the correct output is y. In this\npaper, we consider the case when the number of possible predictions and the number of possible\nlabels are both \ufb01nite. For simplicity,1 we also assume that the sets of possible predictions and correct\noutputs always coincide and do not depend on x. We refer to this set as the set of labels Y, denote its\ncardinality by k, and map its elements to 1, . . . , k. In this setting, assuming that the loss function\ndepends only on \u02c6y and y, but not on x directly, the loss is de\ufb01ned by a loss matrix L \u2208 Rk\u00d7k. We\nassume that all the elements of the matrix L are non-negative and will use Lmax to denote the maximal\nelement. Compared to multi-class classi\ufb01cation, k is typically exponentially large in the size of the\nnatural dimension of y, e.g., contains all possible sequences of symbols from a \ufb01nite alphabet.\nFollowing standard practices in structured prediction [12, 44], we de\ufb01ne the prediction model by\na score function f : X \u2192 Rk specifying a score fy(x) for each possible output y \u2208 Y. The \ufb01nal\nprediction is done by selecting a label with the maximal value of the score\n\npred(f(x)) := argmax\n\n\u02c6y\u2208Y\n\nf \u02c6y(x),\n\n(1)\n\nwith some \ufb01xed strategy to resolve ties. To simplify the analysis, we assume that among the labels\nwith maximal scores, the predictor always picks the one with the smallest index.\nThe goal of prediction-based machine learning consists in \ufb01nding a predictor that works well on\nthe unseen test set, i.e., data points coming from the same distribution D as the one generating the\ntraining data. One way to formalize this is to minimize the generalization error, often referred to as\nthe actual (or population) risk based on the loss L,\n\nRL(f) := IE(x,y)\u223cD L(cid:0)pred(f(x)), y(cid:1).\n\n(2)\nMinimizing the actual risk (2) is usually hard. The standard approach is to minimize a surrogate risk,\nwhich is a different objective easier to optimize, e.g., convex. We de\ufb01ne a surrogate loss as a function\n1Our analysis is generalizable to rectangular losses, e.g., ranking losses studied by Ramaswamy et al. [40].\n\n2\n\n\fR\u03a6(f) := IE(x,y)\u223cD \u03a6(f(x), y),\n\n\u03a6 : Rk \u00d7 Y \u2192 R depending on a score vector f = f(x) \u2208 Rk and a target label y \u2208 Y as input\narguments. We denote the y-th component of f with fy. The surrogate risk (the \u03a6-risk) is de\ufb01ned as\n(3)\nwhere the expectation is taken w.r.t. the data-generating distribution D. To make the minimization\nof (3) well-de\ufb01ned, we always assume that the surrogate loss \u03a6 is bounded from below and continuous.\nExamples of common surrogate losses include the structured hinge-loss [44, 47] \u03a6SSVM(f , y) :=\n\nmax \u02c6y\u2208Y(cid:0)f \u02c6y + L( \u02c6y, y)(cid:1) \u2212 fy, the log loss (maximum likelihood learning) used, e.g., in conditional\nrandom \ufb01elds [25], \u03a6log(f , y) := log((cid:80)\nIn terms of task losses, we consider the unstructured 0-1 loss L01( \u02c6y, y) := [ \u02c6y (cid:54)= y],2 and the\n(cid:80)T\ntwo following structured losses: block 0-1 loss with b equal blocks of labels L01,b( \u02c6y, y) :=\n[ \u02c6y and y are not in the same block]; and (normalized) Hamming loss between tuples of T binary\nt=1[\u02c6yt (cid:54)= yt]. To illustrate some aspects of our analysis, we also\nvariables yt: LHam,T ( \u02c6y, y) := 1\nT\nlook at the mixed loss L01,b,\u03b7: a convex combination of the 0-1 and block 0-1 losses, de\ufb01ned as\nL01,b,\u03b7 := \u03b7L01 + (1 \u2212 \u03b7)L01,b for some \u03b7 \u2208 [0, 1].\n\n\u02c6y\u2208Y exp f \u02c6y) \u2212 fy, and their hybrids [38, 21, 22, 41].\n\n3 Consistency for structured prediction\n\n3.1 Calibration function\nWe now formalize the connection between the actual risk RL and the surrogate \u03a6-risk R\u03a6 via the\nso-called calibration function, see De\ufb01nition 1 below [5, 49, 43, 18, 3]. As it is standard for this\nkind of analysis, the setup is non-parametric, i.e. it does not take into account the dependency of\nscores on input variables x. For now, we assume that a family of score functions FF consists of all\nvector-valued Borel measurable functions f : X \u2192 F where F \u2286 Rk is a subspace of allowed score\nvectors, which will play an important role in our analysis. This setting is equivalent to a pointwise\nanalysis, i.e, looking at the different input x independently. We bring the dependency on the input\nback into the analysis in Section 3.3 where we assume a speci\ufb01c family of score functions.\nLet DX represent the marginal distribution for D on x and IP(\u00b7 | x) denote its conditional given x.\nWe can now rewrite the risk RL and \u03a6-risk R\u03a6 as\n\nRL(f) = IEx\u223cDX (cid:96)(f(x), IP(\u00b7 | x)), R\u03a6(f) = IEx\u223cDX \u03c6(f(x), IP(\u00b7 | x)),\n\nwhere the conditional risk (cid:96) and the conditional \u03a6-risk \u03c6 depend on a vector of scores f and a\nconditional distribution on the set of output labels q as\n\n(cid:96)(f , q) :=\n\nqcL(pred(f ), c), \u03c6(f , q) :=\n\nqc\u03a6(f , c).\n\nc=1\n\nThe calibration function H\u03a6,L,F between the surrogate loss \u03a6 and the task loss L relates the excess\nsurrogate risk with the actual excess risk via the excess risk bound:\n\nH\u03a6,L,F (\u03b4(cid:96)(f , q)) \u2264 \u03b4\u03c6(f , q), \u2200f \u2208 F, \u2200q \u2208 \u2206k,\n\n(4)\nwhere \u03b4\u03c6(f , q) = \u03c6(f , q) \u2212 inf \u02c6f\u2208F \u03c6( \u02c6f , q), \u03b4(cid:96)(f , q) = (cid:96)(f , q) \u2212 inf \u02c6f\u2208F (cid:96)( \u02c6f , q) are the excess\nrisks and \u2206k denotes the probability simplex on k elements.\nIn other words, to \ufb01nd a vector f that yields an excess risk smaller than \u03b5, we need to optimize the\n\u03a6-risk up to H\u03a6,L,F (\u03b5) accuracy (in the worst case). We make this statement precise in Theorem 2\nbelow, and now proceed to the formal de\ufb01nition of the calibration function.\nDe\ufb01nition 1 (Calibration function). For a task loss L, a surrogate loss \u03a6, a set of feasible scores F,\nthe calibration function H\u03a6,L,F (\u03b5) (de\ufb01ned for \u03b5 \u2265 0) equals the in\ufb01mum excess of the conditional\nsurrogate risk when the excess of the conditional actual risk is at least \u03b5:\n\n(cid:88)k\n\nc=1\n\n(cid:88)k\n\nH\u03a6,L,F (\u03b5) := inf\n\nf\u2208F , q\u2208\u2206k\n\n\u03b4\u03c6(f , q)\n\u03b4(cid:96)(f , q) \u2265 \u03b5.\n\ns.t.\n\n(5)\n\n(6)\n\nWe set H\u03a6,L,F (\u03b5) to +\u221e when the feasible set is empty.\nBy construction, H\u03a6,L,F is non-decreasing on [0, +\u221e), H\u03a6,L,F (\u03b5) \u2265 0, the inequality (4) holds,\nand H\u03a6,L,F (0) = 0. Note that H\u03a6,L,F can be non-convex and even non-continuous (see examples\nin Figure 1). Also, note that large values of H\u03a6,L,F (\u03b5) are better.\n\n2Here we use the Iverson bracket notation, i.e., [A] := 1 if a logical expression A is true, and zero otherwise.\n\n3\n\n\f(a): Hamming loss LHam,T\n\n(b): Mixed loss L01,b,0.4\n\nFigure 1: Calibration functions for the quadratic surrogate \u03a6quad (12) de\ufb01ned in Section 4 and two\ndifferent task losses. (a) \u2013 the calibration functions for the Hamming loss LHam,T when used without\nconstraints on the scores, F = Rk (in red), and with the tight constraints implying consistency,\nF = span(LHam,T ) (in blue). The red curve can grow exponentially slower than the blue one. (b) \u2013\nthe calibration functions for the mixed loss L01,b,\u03b7 with \u03b7 = 0.4 (see Section 2 for the de\ufb01nition) when\nused without constraints on the scores (red) and with tight constraints for the block 0-1 loss (blue).\nThe blue curve represents level-0.2 consistency. The calibration function equals zero for \u03b5 \u2264 \u03b7/2,\nbut grows exponentially faster than the red curve representing a consistent approach and thus could\nbe better for small \u03b7. More details on the calibration functions in this \ufb01gure are given in Section 4.\n\n3.2 Notion of consistency\n\nWe use the calibration function H\u03a6,L,F to set a connection between optimizing the surrogate and\ntask losses by Theorem 2, which is similar to Theorem 3 of Zhang [49].\nTheorem 2 (Calibration connection). Let H\u03a6,L,F be the calibration function between the surrogate\nloss \u03a6 and the task loss L with feasible set of scores F \u2286 Rk. Let \u02c7H\u03a6,L,F be a convex non-decreasing\nlower bound of the calibration function. Assume that \u03a6 is continuous and bounded from below. Then,\nfor any \u03b5 > 0 with \ufb01nite \u02c7H\u03a6,L,F (\u03b5) and any f \u2208 FF , we have\n\nR\u03a6(f) < R\u2217\n\n\u03a6,F + \u02c7H\u03a6,L,F (\u03b5) \u21d2 RL(f) < R\u2217\n\nL,F + \u03b5,\n\nwhere R\u2217\n\n\u03a6,F := inf f\u2208FF R\u03a6(f) and R\u2217\n\nL,F := inf f\u2208FF RL(f).\n\n(7)\n\n(8)\n\nProof. We take the expectation of (4) w.r.t. x, where the second argument of (cid:96) is set to the conditional\ndistribution IP(\u00b7 | x). Then, we apply Jensen\u2019s inequality (since \u02c7H\u03a6,L,F is convex) to get\n\n\u02c7H\u03a6,L,F (RL(f) \u2212 R\u2217\n\nL,F ) \u2264 R\u03a6(f) \u2212 R\u2217\n\n\u03a6,F < \u02c7H\u03a6,L,F (\u03b5),\n\nwhich implies (7) by monotonicity of \u02c7H\u03a6,L,F .\n\nA suitable convex non-decreasing lower bound \u02c7H\u03a6,L,F (\u03b5) required by Theorem 2 always exists, e.g.,\nthe zero constant. However, in this case Theorem 2 is not informative, because the l.h.s. of (7) is\nnever true. Zhang [49, Proposition 25] claims that \u02c7H\u03a6,L,F de\ufb01ned as the lower convex envelope of\nthe calibration function H\u03a6,L,F satis\ufb01es \u02c7H\u03a6,L,F (\u03b5) > 0, \u2200\u03b5 > 0, if H\u03a6,L,F (\u03b5) > 0, \u2200\u03b5 > 0, and,\ne.g., the set of labels is \ufb01nite. This statement implies that an informative \u02c7H\u03a6,L,F always exists and\nallows to characterize consistency through properties of the calibration function H\u03a6,L,F .\nWe now de\ufb01ne a notion of level-\u03b7 consistency, which is more general than consistency.\nDe\ufb01nition 3 (level-\u03b7 consistency). A surrogate loss \u03a6 is consistent up to level \u03b7 \u2265 0 w.r.t. a task\nloss L and a set of scores F if and only if the calibration function satis\ufb01es H\u03a6,L,F (\u03b5) > 0 for all\n\u03b5 > \u03b7 and there exists \u02c6\u03b5 > \u03b7 such that H\u03a6,L,F (\u02c6\u03b5) is \ufb01nite.\n\nLooking solely at (standard level-0) consistency vs. inconsistency might be too coarse to capture\npractical properties related to optimization accuracy (see, e.g., [29]). For example, if H\u03a6,L,F (\u03b5) = 0\nonly for very small values of \u03b5, then the method can still optimize the actual risk up to a certain\nlevel which might be good enough in practice, especially if it means that it can be optimized faster.\nExamples of calibration functions for consistent and inconsistent surrogate losses are shown in\nFigure 1.\nOther notions of consistency. De\ufb01nition 3 with \u03b7 = 0 and F = Rk results in the standard setting\noften appearing in the literature. In particular, in this case Theorem 2 implies Fisher consistency as\n\n4\n\n00.51\"H(\")no constraintstight constraints00.20.4\"H(\")no constraintstight constraints\fformulated, e.g., by Pedregosa et al. [37] for general losses and Lin [27] for binary classi\ufb01cation.\nThis setting is also closely related to many de\ufb01nitions of consistency used in the literature. For\nexample, for a bounded from below and continuous surrogate, it is equivalent to in\ufb01nite-sample\nconsistency [49], classi\ufb01cation calibration [46], edge-consistency [18], (L, Rk)-calibration [39],\nprediction calibration [48]. See [49, Appendix A] for the detailed discussion.\nRole of F. Let the approximation error for the restricted set of scores F be de\ufb01ned as R\u2217\nL :=\ninf f\u2208FF RL(f) \u2212 inf f RL(f). For any conditional distribution q, the score vector f := \u2212Lq will\nyield an optimal prediction. Thus the condition span(L) \u2286 F is suf\ufb01cient for F to have zero\napproximation error for any distribution D, and for our 0-consistency condition to imply the standard\nFisher consistency with respect to L. In the following, we will see that a restricted F can both play a\nrole for computational ef\ufb01ciency as well as statistical ef\ufb01ciency (thus losses with smaller span(L)\nmight be easier to work with).\n\nL,F \u2212R\u2217\n\n3.3 Connection to optimization accuracy and statistical ef\ufb01ciency\n\nThe scale of a calibration function is not intrinsically well-de\ufb01ned: we could multiply the surrogate\nfunction by a scalar and it would multiply the calibration function by the same scalar, without\nchanging the optimization problem. Intuitively, we would like the surrogate loss to be of order 1. If\nwith this scale the calibration function is exponentially small (has a 1/k factor), then we have strong\nevidence that the stochastic optimization will be dif\ufb01cult (and thus learning will be slow).\nTo formalize this intuition, we add to the picture the complexity of optimizing the surrogate loss with\na stochastic approximation algorithm. By using a scale-invariant convergence rate, we provide a\nnatural normalization of the calibration function. The following two observations are central to the\ntheoretical insights provided in our work:\n1. Scale. For a properly scaled surrogate loss, the scale of the calibration function is a good indication\nof whether a stochastic approximation algorithm will take a large number of iterations (in the worst\ncase) to obtain guarantees of small excess of the actual risk (and vice-versa, a large coef\ufb01cient\nindicates a small number of iterations). The actual veri\ufb01cation requires computing the normalization\nquantities given in Theorem 6 below.\n2. Statistics. The bound on the number of iterations directly relates to the number of training\nexamples that would be needed to learn, if we see each iteration of the stochastic approximation\nalgorithm as using one training example to optimize the expected surrogate.\nTo analyze the statistical convergence of surrogate risk optimization, we have to specify the set of\nscore functions that we work with. We assume that the structure on input x \u2208 X is de\ufb01ned by a\npositive de\ufb01nite kernel K : X \u00d7 X \u2192 R. We denote the corresponding reproducing kernel Hilbert\nspace (RKHS) by H and its explicit feature map by \u03c8(x) \u2208 H. By the reproducing property, we\nhave (cid:104)f, \u03c8(x)(cid:105)H = f (x) for all x \u2208 X , f \u2208 H, where (cid:104)\u00b7,\u00b7(cid:105)H is the inner product in the RKHS. We\nde\ufb01ne the subspace of allowed scores F \u2286 Rk via the span of the columns of a matrix F \u2208 Rk\u00d7r.\nThe matrix F explicitly de\ufb01nes the structure of the score function. With this notation, we will assume\nthat the score function is of the form f(x) = F W \u03c8(x), where W : H \u2192 Rr is a linear operator\nto be learned (a matrix if H is of \ufb01nite dimension) that represents a collection of r elements in H,\ntransforming \u03c8(x) to a vector in Rr by applying the RKHS inner product r times.3 Note that for\nstructured losses, we usually have r (cid:28) k. The set of all score functions is thus obtained by varying W\nin this de\ufb01nition and is denoted by FF,H. As a concrete example of a score family FF,H for structured\nprediction, consider the standard sequence model with unary and pairwise potentials. In this case, the\ndimension r equals T s + (T \u2212 1)s2, where T is the sequence length and s is the number of labels\nof each variable. The columns of the matrix F consist of 2T \u2212 1 groups (one for each unary and\npairwise potential). Each row of F has exactly one entry equal to one in each column group (with\nzeros elsewhere).\nIn this setting, we use the online projected averaged stochastic subgradient descent ASGD4 (stochastic\nw.r.t. data (x(n), y(n)) \u223c D) to minimize the surrogate risk directly [6]. The n-th update consists in\n(9)\n\n(cid:2)W (n\u22121) \u2212 \u03b3(n)F T\u2207\u03a6\u03c8(x(n))T(cid:3),\n\nW (n) := PD\n\n3Note that if rank(F ) = r, our setup is equivalent to assuming a joint kernel [47] in the product form:\n\nKjoint((x, c), (x(cid:48), c(cid:48))) := K(x, x(cid:48))F (c, :)F (c(cid:48), :)T, where F (c, :) is the row c for matrix F .\n\n4See, e.g., [36] for the formal setup of kernel ASGD.\n\n5\n\n\fwhere F T\u2207\u03a6\u03c8(x(n))T : H \u2192 Rr is the stochastic functional gradient, \u03b3(n) is the step size\nand PD is the projection on the ball of radius D w.r.t. the Hilbert\u2013Schmidt norm5. The vector\n\u2207\u03a6 \u2208 Rk is a regular gradient of the sampled surrogate \u03a6(f(x(n)), y(n)) w.r.t. the scores, \u2207\u03a6 =\n\u2207f \u03a6(f , y(n))|f =f(x(n)). We wrote the above update using an explicit feature map \u03c8 for notational\nsimplicity, but kernel ASGD can also be implemented without it by using the kernel trick. The\nconvergence properties of ASGD in RKHS are analogous to the \ufb01nite-dimensional ASGD because\nthey rely on dimension-free quantities. To use a simple convergence analysis, we follow Ciliberto\net al. [10] and make the following simplifying assumption:\nAssumption 4 (Well-speci\ufb01ed optimization w.r.t. the function class FF,H). The distribution D is\nsuch that R\u2217\nAssumption 4 simply means that each row of W \u2217 de\ufb01ning f\u2217 belongs to the RKHS H implying\na \ufb01nite norm (cid:107)W \u2217(cid:107)HS. Assumption 4 can be relaxed if the kernel K is universal, but then the\nconvergence analysis becomes much more complicated [36].\nTheorem 5 (Convergence rate). Under Assumption 4 and assuming that (i) the functions \u03a6(f , y)\nare bounded from below and convex w.r.t. f \u2208 Rk for all y \u2208 Y; (ii) the expected square of the norm\nHS \u2264 M 2 and (iii) (cid:107)W \u2217(cid:107)HS \u2264 D,\nof the stochastic gradient is bounded, IE(x,y)\u223cD(cid:107)F T\u2207\u03a6\u03c8(x)T(cid:107)2\n\u221a\nfor N steps admits the\nthen running the ASGD algorithm (9) with the constant step-size \u03b3 := 2D\n(cid:88)N\nN\nfollowing expected suboptimality for the averaged iterate \u00aff(N ):\n\n\u03a6,F := inf f\u2208FF R\u03a6(f) has some global minimum f\u2217 that also belongs to FF,H.\n\nM\n\nIE[R\u03a6(\u00aff(N ))] \u2212 R\u2217\n\n\u03a6,F \u2264 2DM\u221a\n\nN\n\nwhere \u00aff(N ) := 1\nN\n\nn=1\nTheorem 5 is a straight-forward extension of classical results [33, 36].\nBy combining the convergence rate of Theorem 5 with Theorem 2 that connects the surrogate and\nactual risks, we get Theorem 6 which explicitly gives the number of iterations required to achieve\n\u03b5 accuracy on the expected population risk (see App. A for the proof). Note that since ASGD is\napplied in an online fashion, Theorem 6 also serves as the sample complexity bound, i.e., says how\nmany samples are needed to achieve \u03b5 target accuracy (compared to the best prediction rule if F has\nzero approximation error).\nTheorem 6 (Learning complexity). Under the assumptions of Theorem 5, for any \u03b5 > 0, the random\n(w.r.t. the observed training set) output \u00aff(N ) \u2208 FF,H of the ASGD algorithm after\n\niterations has the expected excess risk bounded with \u03b5, i.e., IE[RL(\u00aff(N ))] < R\u2217\n\nL,F + \u03b5.\n\nN > N\u2217 := 4D2M 2\n\u03a6,L,F (\u03b5)\n\n\u02c7H 2\n\n(11)\n\n4 Calibration function analysis for quadratic surrogate\n\nF W (n)\u03c8(x(n))T.\n\n(10)\n\nA major challenge to applying Theorem 6 is the computation of the calibration function H\u03a6,L,F . In\nApp. C, we present a generalization to arbitrary multi-class losses of a surrogate loss class from Zhang\n[49, Section 4.4.2] that is consistent for any task loss L. Here, we consider the simplest example of\nthis family, called the quadratic surrogate \u03a6quad, which has the advantage that we can bound or even\ncompute exactly its calibration function. We de\ufb01ne the quadratic surrogate as\n\nk(cid:88)\n\n\u03a6quad(f , y) := 1\n\n2k(cid:107)f + L(:, y)(cid:107)2\n\n2 = 1\n2k\n\n(f 2\n\nc + 2fcL(c, y) + L(c, y)2).\n\n(12)\n\nc=1\n\nOne simple suf\ufb01cient condition for the surrogate (12) to be consistent and also to have zero approxi-\nmation error is that F fully contains span(L). To make the dependence on the score subspace explicit,\nwe parameterize it with a matrix F \u2208 Rk\u00d7r with the number of columns r typically being much\nsmaller than the number of labels k. With this notation, we have F = span(F ) = {F \u03b8 | \u03b8 \u2208 Rr},\nand the dimensionality of F equals the rank of F , which is at most r.6\n5The Hilbert\u2013Schmidt norm of a linear operator A is de\ufb01ned as (cid:107)A(cid:107)HS =\n\n\u221a\ntrA\u2021A where A\u2021 is the adjoint\noperator. In the case of \ufb01nite dimension, the Hilbert\u2013Schmidt norm coincides with the Frobenius matrix norm.\n6Evaluating \u03a6quad requires computing F TF and F TL(:, y) for which direct computation is intractable when\nk is exponential, but which can be done in closed form for the structured losses we consider (the Hamming and\nblock 0-1 loss). More generally, these operations require suitable inference algorithms. See also App. F.\n\n6\n\n\fFor the quadratic surrogate (12), the excess of the expected surrogate takes a simple form:\n\n2k(cid:107)F \u03b8 + Lq(cid:107)2\n2.\n\n\u03b4\u03c6quad(F \u03b8, q) = 1\n\n(13)\nEquation (13) holds under the assumption that the subspace F contains the column space of the\nloss matrix span(L), which also means that the set F contains the optimal prediction for any q (see\nLemma 9 in App. B for the proof). Importantly, the function \u03b4\u03c6quad(F \u03b8, q) is jointly convex in the\nconditional probability q and parameters \u03b8, which simpli\ufb01es its analysis.\nLower bound on the calibration function. We now present our main technical result: a lower\nbound on the calibration function for the surrogate loss \u03a6quad (12). This lower bound characterizes\nthe easiness of learning with this surrogate given the scaling intuition mentioned in Section 3.3. The\nproof of Theorem 7 is given in App. D.1.\nTheorem 7 (Lower bound on H\u03a6quad). For any task loss L, its quadratic surrogate \u03a6quad, and a score\nsubspace F containing the column space of L, the calibration function can be lower bounded:\n\nH\u03a6quad,L,F (\u03b5) \u2265\n\n\u03b52\n\n\u2265 \u03b52\n4k ,\n\n2\n\n4k , \u03b52\n\n4b and \u03b52\n\n2k maxi(cid:54)=j (cid:107)PF \u2206ij(cid:107)2\n\n(14)\nwhere PF is the orthogonal projection on the subspace F and \u2206ij = ei \u2212 ej \u2208 Rk with ec being\nthe c-th basis vector of the standard basis in Rk.\nLower bound for speci\ufb01c losses. We now discuss the meaning of the bound (14) for some speci\ufb01c\nlosses (the detailed derivations are given in App. D.3). For the 0-1, block 0-1 and Hamming losses\n(L01, L01,b and LHam,T , respectively) with the smallest possible score subspaces F, the bound (14)\n8T , respectively. All these bounds are tight (see App. E). However, if F = Rk\ngives \u03b52\nthe bound (14) is not tight for the block 0-1 and mixed losses (see also App. E). In particular, the\nbound (14) cannot detect level-\u03b7 consistency for \u03b7 > 0 (see Def. 3) and does not change when the\nloss changes, but the score subspace stays the same.\nUpper bound on the calibration function. Theorem 8 below gives an upper bound on the calibration\nfunction holding for unconstrained scores, i.e, F = Rk (see the proof in App. D.2). This result shows\nthat without some appropriate constraints on the scores, ef\ufb01cient learning is not guaranteed (in the\nworst case) because of the 1/k scaling of the calibration function.\nTheorem 8 (Upper bound on H\u03a6quad). If a loss matrix L with Lmax > 0 de\ufb01nes a pseudometric7 on\nlabels and there are no constraints on the scores, i.e., F = Rk, then the calibration function for the\nquadratic surrogate \u03a6quad can be upper bounded: H\u03a6quad,L,Rk (\u03b5) \u2264 \u03b52\n2k ,\nFrom our lower bound in Theorem 7 (which guarantees consistency), the natural constraint on\nthe score is F = span(L), with the dimension of this space giving an indication of the intrinsic\n\u201cdif\ufb01culty\u201d of a loss. Computations for the lower bounds in some speci\ufb01c cases (see App. D.3 for\ndetails) show that the 0-1 loss is \u201chard\u201d while the block 0-1 loss and the Hamming loss are \u201ceasy\u201d.\nNote that in all these cases the lower bound (14) is tight, see the discussion below.\nExact calibration functions. Note that the bounds proven in Theorems 7 and 8 imply that, in the\ncase of no constraints on the scores F = Rk, for the 0-1, block 0-1 and Hamming losses, we have\n(15)\nwhere L is the matrix de\ufb01ning a loss. For completeness, in App. E, we compute the exact calibration\nfunctions for the 0-1 and block 0-1 losses. Note that the calibration function for the 0-1 loss equals the\nlower bound, illustrating the worst-case scenario. To get some intuition, an example of a conditional\ndistribution q that gives the (worst case) value to the calibration function (for several losses) is\nqi = 1\nIn what follows, we provide the calibration functions in the cases with constraints on the scores. For\nthe block 0-1 loss with b equal blocks and under constraints that the scores within blocks are equal,\nthe calibration function equals (see Proposition 14 of App. E.2)\n\n2 and qc = 0 for c (cid:54)\u2208 {i, j}. See the proof of Proposition 12 in App. E.1.\n\n4k \u2264 H\u03a6quad,L,Rk (\u03b5) \u2264 \u03b52\n2k ,\n\n0 \u2264 \u03b5 \u2264 Lmax.\n\n2, qj = 1\n\n\u03b52\n\n2 + \u03b5\n\n2 \u2212 \u03b5\n\n(16)\n7A pseudometric is a function d(a, b) satisfying the following axioms: d(x, y) \u2265 0, d(x, x) = 0 (but\n\nH\u03a6quad,L01,b,F01,b (\u03b5) = \u03b52\n4b ,\n\npossibly d(x, y) = 0 for some x (cid:54)= y), d(x, y) = d(y, x), d(x, z) \u2264 d(x, y) + d(y, z).\n\n0 \u2264 \u03b5 \u2264 1.\n\n7\n\n\fFor the Hamming loss de\ufb01ned over T binary variables and under constraints implying separable\nscores, the calibration function equals (see Proposition 15 in App. E.3)\n8T , 0 \u2264 \u03b5 \u2264 1.\n\nH\u03a6quad,LHam,T ,FHam,T (\u03b5) = \u03b52\n\n(17)\n\nb (\u03b5 \u2212 \u03b7\n\n2 , the calibration function is of the order 1\n\nThe calibration functions (16) and (17) depend on the quantities representing the actual complexities\nof the loss (the number of blocks b and the length of the sequence T ) and can be exponentially larger\nthan the upper bound for the unconstrained case.\nIn the case of mixed 0-1 and block 0-1 loss, if the scores f are constrained to be equal inside the\nblocks, i.e., belong to the subspace F01,b = span(L01,b) (cid:40) Rk, then the calibration function is equal\nto 0 for \u03b5 \u2264 \u03b7\n2 , implying inconsistency (and also note that the approximation error can be as big as \u03b7\nfor F01,b). However, for \u03b5 > \u03b7\n2 )2. See Figure 1b for\nthe illustration of this calibration function and Proposition 17 of App. E.4 for the exact formulation\nand the proof. Note that while the calibration function for the constrained case is inconsistent, its\nvalue can be exponentially larger than the one for the unconstrained case for \u03b5 big enough and when\nthe blocks are exponentially large (see Proposition 16 of App. E.4).\nComputation of the SGD constants. Applying the learning complexity Theorem 6 requires to\ncompute the quantity DM where D bounds the norm of the optimal solution and M bounds the\nexpected square of the norm of the stochastic gradient. In App. F, we provide a way to bound this\nquantity for our quadratic surrogate (12) under the simplifying assumption that each conditional qc(x)\n(seen as function of x) belongs to the RKHS H (which implies Assumption 4). In particular, we get\n(18)\n\nobject feature maps (cid:107)\u03c8(x)(cid:107)H. We de\ufb01ne Qmax as an upper bound on(cid:80)k\nthe generalization of the inequality(cid:80)k\n\nwhere \u03ba(F ) is the condition number of the matrix F , R is an upper bound on the RKHS norm of\nc=1 (cid:107)qc(cid:107)H (can be seen as\nc=1 qc \u2264 1 for probabilities). The constants R and Qmax depend\non the data, the constant Lmax depends on the loss, r and \u03ba(F ) depend on the choice of matrix F .\nWe compute the constant DM for the speci\ufb01c losses that we considered in App. F.1. For the 0-1, block\n0-1 and Hamming losses, we have DM = O(k), DM = O(b) and DM = O(log3\n2 k), respectively.\nThese computations indicate that the quadratic surrogate allows ef\ufb01cient learning for structured block\n0-1 and Hamming losses, but that the convergence could be slow in the worst case for the 0-1 loss.\n\n\u221a\nmax\u03be(\u03ba(F )\n\nDM = L2\n\nrRQmax),\n\n\u03be(z) = z2 + z,\n\n5 Related works\n\nConsistency for multi-class problems. Building on signi\ufb01cant progress for the case of binary\nclassi\ufb01cation, see, e.g. [5], there has been a lot of interest in the multi-class case. Zhang [49] and\nTewari & Bartlett [46] analyze the consistency of many existing surrogates for the 0-1 loss. Gao &\nZhou [20] focus on multi-label classi\ufb01cation. Narasimhan et al. [32] provide a consistent algorithm\nfor arbitrary multi-class loss de\ufb01ned by a function of the confusion matrix. Recently, Ramaswamy &\nAgarwal [39] introduce the notion of convex calibrated dimension, as the minimal dimensionality of\nthe score vector that is required for consistency. In particular, they showed that for the Hamming loss\non T binary variables, this dimension is at most T . In our analysis, we use scores of rank (T + 1),\nsee (35) in App. D.3, yielding a similar result.\nThe task of ranking has attracted a lot of attention and [18, 8, 9, 40] analyze different families of\nsurrogate and task losses proving their (in-)consistency. In this line of work, Ramaswamy et al.\n[40] propose a quadratic surrogate for an arbitrary low rank loss which is related to our quadratic\nsurrogate (12). They also prove that several important ranking losses, i.e., precision@q, expected\nrank utility, mean average precision and pairwise disagreement, are of low-rank. We conjecture that\nour approach is compatible with these losses and leave precise connections as future work.\nStructured SVM (SSVM) and friends. SSVM [44, 45, 47] is one of the most used convex surrogates\nfor tasks with structured outputs, thus, its consistency has been a question of great interest. It is\nknown that Crammer-Singer multi-class SVM [15], which SSVM is built on, is not consistent for\n0-1 loss unless there is a majority class with probability at least 1\n2 [49, 31]. However, it is consistent\nfor the \u201cabstain\u201d and ordinal losses in the case of 3 classes [39]. Structured ramp loss and probit\nsurrogates are closely related to SSVM and are consistent [31, 16, 30, 23], but not convex.\n\n8\n\n\fminimizing(cid:80)n\n\nRecently, Do\u02d8gan et al. [17] categorized different versions of multi-class SVM and analyzed them\nfrom Fisher and universal consistency point of views. In particular, they highlight differences between\nFisher and universal consistency and give examples of surrogates that are Fisher consistent, but not\nuniversally consistent and vice versa. They also highlight that the Crammer-Singer SVM is neither\nFisher, not universally consistent even with a careful choice of regularizer.\nQuadratic surrogates for structured prediction. Ciliberto et al. [10] and Brouard et al. [7] consider\ni=1 (cid:107)g(xi) \u2212 \u03c8o(yi)(cid:107)2H aiming to match the RKHS embedding of inputs g : X \u2192 H\nto the feature maps of outputs \u03c8o : Y \u2192 H. In their frameworks, the task loss is not considered at\nthe learning stage, but only at the prediction stage. Our quadratic surrogate (12) depends on the loss\ndirectly. The empirical risk de\ufb01ned by both their and our objectives can be minimized analytically with\nthe help of the kernel trick and, moreover, the resulting predictors are identical. However, performing\nsuch computation in the case of large dataset can be intractable and the generalization properties have\nto be taken care of, e.g., by the means of regularization. In the large-scale scenario, it is more natural\nto apply stochastic optimization (e.g., kernel ASGD) that directly minimizes the population risk and\nhas better dependency on the dataset size. When combined with stochastic optimization, the two\napproaches lead to different behavior. In our framework, we need to estimate r = rank(L) scalar\nfunctions, but the alternative needs to estimate k functions (if, e.g., \u03c8o(y) = ey \u2208 Rk), which results\nin signi\ufb01cant differences for low-rank losses, such as block 0-1 and Hamming.\nCalibration functions. Bartlett et al. [5] and Steinwart [43] provide calibration functions for most\nexisting surrogates for binary classi\ufb01cation. All these functions differ in term of shape, but are\nroughly similar in terms of constants. Pedregosa et al. [37] generalize these results to the case of\nordinal regression. However, their calibration functions have at best a 1/k factor if the surrogate is\nnormalized w.r.t. the number of classes. The task of ranking has been of signi\ufb01cant interest. However,\nmost of the literature [e.g., 11, 14, 24, 1], only focuses on calibration functions (in the form of regret\nbounds) for bipartite ranking, which is more akin to cost-sensitive binary classi\ufb01cation.\n\u00c1vila Pires et al. [3] generalize the theoretical framework developed by Steinwart [43] and present\nresults for the multi-class SVM of Lee et al. [26] (the score vectors are constrained to sum to zero)\nc\u2208Y L(c, y)a(fc)\nc\u2208Y fc = 0 and a(f ) is some convex function with all subgradients at zero being positive.\nThe recent work by \u00c1vila Pires & Szepesv\u00e1ri [2] re\ufb01nes the results, but speci\ufb01cally for the case of\n0-1 loss. In this line of work, the surrogate is typically not normalized by k, and if normalized the\ncalibration functions have the constant 1/k appearing.\nFinally, Ciliberto et al. [10] provide the calibration function for their quadratic surrogate. Assuming\nthat the loss can be represented as L( \u02c6y, y) = (cid:104)V \u03c8o( \u02c6y), \u03c8o(y)(cid:105)HY , \u02c6y, y \u2208 Y (this assumption can\nalways be satis\ufb01ed in the case of a \ufb01nite number of labels, by taking V as the loss matrix L and\n\u03c8o(y) := ey \u2208 Rk where ey is the y-th vector of the standard basis in Rk). In their Theorem 2, they\nprovide an excess risk bound leading to a lower bound on the corresponding calibration function\nH\u03a6,L,Rk (\u03b5) \u2265 \u03b52\nwhere a constant c\u2206 = (cid:107)V (cid:107)2 maxy\u2208Y (cid:107)\u03c8o(y)(cid:107) simply equals the spectral norm\nof the loss matrix for the \ufb01nite-dimensional construction provided above. However, the spectral\nnorm of the loss matrix is exponentially large even for highly structured losses such as the block 0-1\nand Hamming losses, i.e., (cid:107)L01,b(cid:107)2 = k \u2212 k\n2 . This conclusion puts the objective\nof Ciliberto et al. [10] in line with ours when no constraints are put on the scores.\n\nthat can be built for any task loss of interest. Their surrogate \u03a6 is of the form(cid:80)\nwhere(cid:80)\n\nc2\n\u2206\n\nb , (cid:107)LHam,T(cid:107)2 = k\n\n6 Conclusion\n\nIn this paper, we studied the consistency of convex surrogate losses speci\ufb01cally in the context\nof structured prediction. We analyzed calibration functions and proposed an optimization-based\nnormalization aiming to connect consistency with the existence of ef\ufb01cient learning algorithms.\nFinally, we instantiated all components of our framework for several losses by computing the\ncalibration functions and the constants coming from the normalization. By carefully monitoring\nexponential constants, we highlighted the difference between tractable and intractable task losses.\nThese were \ufb01rst steps in advancing our theoretical understanding of consistent structured prediction.\nFurther steps include analyzing more losses such as the low-rank ranking losses studied by Ra-\nmaswamy et al. [40] and, instead of considering constraints on the scores, one could instead put\nconstraints on the set of distributions to investigate the effect on the calibration function.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Pascal Germain for useful discussions. This work was partly supported\nby the ERC grant Activia (no. 307574), the NSERC Discovery Grant RGPIN-2017-06936 and the\nMSR-INRIA Joint Center.\n\nReferences\n[1] Agarwal, Shivani. Surrogate regret bounds for bipartite ranking via strongly proper losses.\n\nJournal of Machine Learning Research (JMLR), 15(1):1653\u20131674, 2014.\n\n[2] \u00c1vila Pires, Bernardo and Szepesv\u00e1ri, Csaba. Multiclass classi\ufb01cation calibration functions.\n\narXiv, 1609.06385v1, 2016.\n\n[3] \u00c1vila Pires, Bernardo, Ghavamzadeh, Mohammad, and Szepesv\u00e1ri, Csaba. Cost-sensitive\n\nmulticlass classi\ufb01cation risk bounds. In ICML, 2013.\n\n[4] Bakir, G\u00f6khan, Hofmann, Thomas, Sch\u00f6lkopf, Bernhard, Smola, Alexander J., Taskar, Ben,\n\nand Vishwanathan, S.V.N. Predicting Structured Data. MIT press, 2007.\n\n[5] Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[6] Bousquet, Olivier and Bottou, L\u00e9on. The tradeoffs of large scale learning. In NIPS, 2008.\n\n[7] Brouard, C\u00e9line, Szafranski, Marie, and d\u2019Alch\u00e9-Buc, Florence. Input output kernel regression:\nSupervised and semi-supervised structured output prediction with operator-valued kernels.\nJournal of Machine Learning Research (JMLR), 17(176):1\u201348, 2016.\n\n[8] Buffoni, David, Gallinari, Patrick, Usunier, Nicolas, and Calauz\u00e8nes, Cl\u00e9ment. Learning scoring\n\nfunctions with order-preserving losses and standardized supervision. In ICML, 2011.\n\n[9] Calauz\u00e8nes, Cl\u00e9ment, Usunier, Nicolas, and Gallinari, Patrick. On the (non-)existence of convex,\n\ncalibrated surrogate losses for ranking. In NIPS, 2012.\n\n[10] Ciliberto, Carlo, Rudi, Alessandro, and Rosasco, Lorenzo. A consistent regularization approach\n\nfor structured prediction. In NIPS, 2016.\n\n[11] Cl\u00e9men\u00e7on, St\u00e9phan, Lugosi, G\u00e1bor, and Vayatis, Nicolas. Ranking and empirical minimization\n\nof U-statistics. The Annals of Statistics, pp. 844\u2013874, 2008.\n\n[12] Collins, Michael. Discriminative training methods for hidden Markov models: Theory and\n\nexperiments with perceptron algorithms. In EMNLP, 2002.\n\n[13] Cortes, Corinna, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Structured prediction\n\ntheory based on factor graph complexity. In NIPS, 2016.\n\n[14] Cossock, David and Zhang, Tong. Statistical analysis of bayes optimal subset ranking. IEEE\n\nTransactions on Information Theory, 54(11):5140\u20135154, 2008.\n\n[15] Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-\n\nbased vector machines. Journal of Machine Learning Research (JMLR), 2:265\u2013292, 2001.\n\n[16] Do, Chuong B., Le, Quoc, Teo, Choon Hui, Chapelle, Olivier, and Smola, Alex. Tighter bounds\n\nfor structured estimation. In NIPS, 2009.\n\n[17] Do\u02d8gan, \u00dcr\u00fcn, Glasmachers, Tobias, and Igel, Christian. A uni\ufb01ed view on multi-class support\n\nvector classi\ufb01cation. Journal of Machine Learning Research (JMLR), 17(45):1\u201332, 2016.\n\n[18] Duchi, John C., Mackey, Lester W., and Jordan, Michael I. On the consistency of ranking\n\nalgorithms. In ICML, 2010.\n\n[19] Durbin, Richard, Eddy, Sean, Krogh, Anders, and Mitchison, Graeme. Biological sequence\nanalysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.\n\n10\n\n\f[20] Gao, Wei and Zhou, Zhi-Hua. On the consistency of multi-label learning. In COLT, 2011.\n\n[21] Gimpel, Kevin and Smith, Noah A. Softmax-margin CRFs: Training loglinear models with cost\n\nfunctions. In NAACL, 2010.\n\n[22] Hazan, Tamir and Urtasun, Raquel. A primal-dual message-passing algorithm for approximated\n\nlarge scale structured prediction. In NIPS, 2010.\n\n[23] Keshet, Joseph. Optimizing the measure of performance in structured prediction. In Advanced\n\nStructured Prediction. MIT Press, 2014.\n\n[24] Kotlowski, Wojciech, Dembczynski, Krzysztof, and Huellermeier, Eyke. Bipartite ranking\n\nthrough minimization of univariate loss. In ICML, 2011.\n\n[25] Lafferty, John, McCallum, Andrew, and Pereira, Fernando. Conditional random \ufb01elds: Proba-\n\nbilistic models for segmenting and labeling sequence data. In ICML, 2001.\n\n[26] Lee, Yoonkyung, Lin, Yi, and Wahba, Grace. Multicategory support vector machines: Theory\nand application to the classi\ufb01cation of microarray data and satellite radiance data. Journal of\nthe American Statistical Association, 99(465):67\u201381, 2004.\n\n[27] Lin, Yi. A note on margin-based loss functions in classi\ufb01cation. Statistics & Probability Letters,\n\n68(1):73\u201382, 2004.\n\n[28] London, Ben, Huang, Bert, and Getoor, Lise. Stability and generalization in structured prediction.\n\nJournal of Machine Learning Research (JMLR), 17(222):1\u201352, 2016.\n\n[29] Long, Phil and Servedio, Rocco. Consistency versus realizable H-consistency for multiclass\n\nclassi\ufb01cation. In ICML, 2013.\n\n[30] McAllester, D. A. and Keshet, J. Generalization bounds and consistency for latent structural\n\nprobit and ramp loss. In NIPS, 2011.\n\n[31] McAllester, David. Generalization bounds and consistency for structured labeling. In Predicting\n\nStructured Data. MIT Press, 2007.\n\n[32] Narasimhan, Harikrishna, Ramaswamy, Harish G., Saha, Aadirupa, and Agarwal, Shivani.\n\nConsistent multiclass algorithms for complex performance measures. In ICML, 2015.\n\n[33] Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation\napproach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[34] Nowozin, Sebastian and Lampert, Christoph H. Structured learning and prediction in computer\n\nvision. Foundations and Trends in Computer Graphics and Vision, 6(3\u20134):185\u2013365, 2011.\n\n[35] Nowozin, Sebastian, Gehler, Peter V., Jancsary, Jeremy, and Lampert, Christoph H. Advanced\n\nStructured Prediction. MIT Press, 2014.\n\n[36] Orabona, Francesco. Simultaneous model selection and optimization through parameter-free\n\nstochastic learning. In NIPS, 2014.\n\n[37] Pedregosa, Fabian, Bach, Francis, and Gramfort, Alexandre. On the consistency of ordinal\n\nregression methods. Journal of Machine Learning Research (JMLR), 18(55):1\u201335, 2017.\n\n[38] Pletscher, Patrick, Ong, Cheng Soon, and Buhmann, Joachim M. Entropy and margin maxi-\n\nmization for structured output learning. In ECML PKDD, 2010.\n\n[39] Ramaswamy, Harish G. and Agarwal, Shivani. Convex calibration dimension for multiclass\n\nloss matrices. Journal of Machine Learning Research (JMLR), 17(14):1\u201345, 2016.\n\n[40] Ramaswamy, Harish G., Agarwal, Shivani, and Tewari, Ambuj. Convex calibrated surrogates\n\nfor low-rank loss matrices with applications to subset ranking losses. In NIPS, 2013.\n\n[41] Shi, Qinfeng, Reid, Mark, Caetano, Tiberio, van den Hengel, Anton, and Wang, Zhenhua. A\nhybrid loss for multiclass and structured prediction. IEEE transactions on pattern analysis and\nmachine intelligence (TPAMI), 37(1):2\u201312, 2015.\n\n11\n\n\f[42] Smith, Noah A. Linguistic structure prediction. Synthesis lectures on human language tech-\n\nnologies, 4(2):1\u2013274, 2011.\n\n[43] Steinwart, Ingo. How to compare different loss functions and their risks. Constructive Approxi-\n\nmation, 26(2):225\u2013287, 2007.\n\n[44] Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS,\n\n2003.\n\n[45] Taskar, Ben, Chatalbashev, Vassil, Koller, Daphne, and Guestrin, Carlos. Learning structured\n\nprediction models: a large margin approach. In ICML, 2005.\n\n[46] Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research (JMLR), 8:1007\u20131025, 2007.\n\n[47] Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for\nstructured and interdependent output variables. Journal of Machine Learning Research (JMLR),\n6:1453\u20131484, 2005.\n\n[48] Williamson, Robert C., Vernet, Elodie, and Reid, Mark D. Composite multiclass losses. Journal\n\nof Machine Learning Research (JMLR), 17(223):1\u201352, 2016.\n\n[49] Zhang, Tong. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research (JMLR), 5:1225\u20131251, 2004.\n\n[50] Zhang, Tong. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. Annals of Statistics, 32(1):56\u2013134, 2004.\n\n12\n\n\f", "award": [], "sourceid": 246, "authors": [{"given_name": "Anton", "family_name": "Osokin", "institution": "CS HSE"}, {"given_name": "Francis", "family_name": "Bach", "institution": "Inria"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "Universit\u00e9 de Montr\u00e9al"}]}