{"title": "Direct Loss Minimization for Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 1594, "page_last": 1602, "abstract": "In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem.", "full_text": "Direct Loss Minimization for Structured Prediction\n\nDavid McAllester\n\nTTI-Chicago\n\nmcallester@ttic.edu\n\nTamir Hazan\nTTI-Chicago\n\ntamir@ttic.edu\n\nJoseph Keshet\nTTI-Chicago\n\njkeshet@ttic.edu\n\nAbstract\n\nIn discriminative machine learning one is interested in training a system to opti-\nmize a certain desired measure of performance, or loss. In binary classi\ufb01cation\none typically tries to minimizes the error rate. But in structured prediction each\ntask often has its own measure of performance such as the BLEU score in machine\ntranslation or the intersection-over-union score in PASCAL segmentation. The\nmost common approaches to structured prediction, structural SVMs and CRFs, do\nnot minimize the task loss: the former minimizes a surrogate loss with no guar-\nantees for task loss and the latter minimizes log loss independent of task loss.\nThe main contribution of this paper is a theorem stating that a certain perceptron-\nlike learning rule, involving features vectors derived from loss-adjusted inference,\ndirectly corresponds to the gradient of task loss. We give empirical results on pho-\nnetic alignment of a standard test set from the TIMIT corpus, which surpasses all\npreviously reported results on this problem.\n\n1\n\nIntroduction\n\nMany modern software systems compute a result as the solution, or approximate solution, to an op-\ntimization problem. For example, modern machine translation systems convert an input word string\ninto an output word string in a different language by approximately optimizing a score de\ufb01ned on the\ninput-output pair. Optimization underlies the leading approaches in a wide variety of computational\nproblems including problems in computational linguistics, computer vision, genome annotation, ad-\nvertisement placement, and speech recognition. In many optimization-based software systems one\nmust design the objective function as well as the optimization algorithm. Here we consider a param-\neterized objective function and the problem of setting the parameters of the objective in such a way\nthat the resulting optimization-driven software system performs well.\nWe can formulate an abstract problem by letting X be an abstract set of possible inputs and Y an\nabstract set of possible outputs. We assume an objective function sw : X \u00d7 Y \u2192 R parameterized\nby a vector w \u2208 Rd such that for x \u2208 X and y \u2208 Y we have a score sw(x, y). The parameter setting\nw determines a mapping from input x to output yw(x) is de\ufb01ned as follows:\n(1)\n\nyw(x) = argmax\n\nsw(x, y)\n\ny\u2208Y\n\nOur goal is to set the parameters w of the scoring function such that the mapping from input to\noutput de\ufb01ned by (1) performs well. More formally, we assume that there exists some unknown\nprobability distribution \u03c1 over pairs (x, y) where y is the desired output (or reference output) for\ninput x. We assume a loss function L, such as the BLEU score, which gives a cost L(y, \u02c6y) \u2265 0 for\nproducing output \u02c6y when the desired output (reference output) is y. We then want to set w so as to\nminimize the expected loss.\n\nw\u2217 = argmin\n\nE [L(y, yw(x))]\n\n(2)\n\nw\n\nIn (2) the expectation is taken over a random draw of the pair (x, y) form the source data distribution\n\u03c1. Throughout this paper all expectations will be over a random draw of a fresh pair (x, y). In\nmachine learning terminology we refer to (1) as inference and (2) as training.\n\n1\n\n\fUnfortunately the training objective function (2) is typically non-convex and we are not aware of any\npolynomial algorithms (in time and sample complexity) with reasonable approximation guarantees\nto (2) for typical loss functions, say 0-1 loss, and an arbitrary distribution \u03c1. In spite of the lack of\napproximation guarantees, it is common to replace the objective in (2) with a convex relaxation such\nas structural hinge loss [8, 10]. It should be noted that replacing the objective in (2) with structural\nhinge loss leads to inconsistency \u2014 the optimum of the relaxation is different from the optimum of\n(2).\nAn alternative to a convex relaxation is to perform gradient descent directly on the objective in (2).\nIn some applications it seems possible that the local minima problem of non-convex optimization is\nless serious than the inconsistencies introduced by a convex relaxation.\nUnfortunately, direct gradient descent on (2) is conceptually puzzling in the case where the output\nspace Y is discrete. In this case the output yw(x) is not a differentiable function of w. As one\nsmoothly changes w the output yw(x) jumps discontinuously between discrete output values. So\none cannot write \u2207wE [L(y, yw(x))] as E [\u2207wL(y, yw(x))]. However, when the input space X is\ncontinuous the gradient \u2207wE [L(y, yw(x))] can exist even when the output space Y is discrete. The\nmain results of this paper is a perceptron-like method of performing direct gradient descent on (2)\nin the case where the output space is discrete but the input space is continuous.\nAfter formulating our method we discovered that closely related methods have recently become\npopular for training machine translation systems [7, 2]. Although machine translation has discrete\ninputs as well as discrete outputs, the training method we propose can still be used, although without\ntheoretical guarantees. We also present empirical results on the use of this method in phoneme\nalignment on the TIMIT corpus, where it achieves the best known results on this problem.\n\n2 Perceptron-Like Training Methods\n\nPerceptron-like training methods are generally formulated for the case where the scoring function\nis linear in w. In other words, we assume that the scoring function can be written as follows where\n\u03c6 : X \u00d7 Y \u2192 Rd is called a feature map.\nBecause the feature map \u03c6 can itself be nonlinear, and the feature vector \u03c6(x, y) can be very high\ndimensional, objective functions of the this form are highly expressive.\nHere we will formulate perceptron-like training in the data-rich regime where we have access to\nan unbounded sequence (x1, y1), (x2, y2), (x3, y3), . . . where each (xt, yt) is drawn IID from the\ndistribution \u03c1. In the basic structured prediction perceptron algorithm [3] one constructs a sequence\nof parameter settings w0, w1, w2, . . . where w0 = 0 and wt+1 is de\ufb01ned as follows.\n\nsw(x, y) = w(cid:62)\u03c6(x, y)\n\nwt+1 = wt + \u03c6(xt, yt) \u2212 \u03c6(xt, ywt(xt))\n\n(3)\nNote that if ywt(xt) = yt then no update is made and we have wt+1 = wt. If ywt(xt) (cid:54)= yt then the\nupdate changes the parameter vector in a way that favors yt over ywt(xt). If the source distribution \u03c1\nis \u03b3-separable, i.e., there exists a weight vector w with the property that yw(x) = y with probability 1\nand yw(x) is always \u03b3-separated from all distractors, then the perceptron update rule will eventually\nlead to a parameter setting with zero loss. Note, however, that the basic perceptron update does not\ninvolve the loss function L. Hence it cannot be expected to optimize the training objective (2) in\ncases where zero loss is unachievable.\nA loss-sensitive perceptron-like algorithm can be derived from the structured hinge loss of a margin-\nscaled structural SVM [10]. The optimization problem for margin-scaled structured hinge loss can\nbe de\ufb01ned as follows.\n\n(cid:20)\n\n(cid:0)L(y, \u02dcy) \u2212 w(cid:62) (\u03c6(x, y) \u2212 \u03c6(x, \u02dcy))(cid:1)(cid:21)\n\nw\u2217 = argmin\n\nE\n\nw\n\nmax\n\u02dcy\u2208Y\n\nyt\nhinge = argmax\n\n\u02dcy\u2208Y\n\n= argmax\n\n\u02dcy\u2208Y\n\nIt can be shown that this is a convex relaxation of (2). We can optimize this convex relaxation\nwith stochastic sub-gradient descent. To do this we compute a sub-gradient of the objective by \ufb01rst\ncomputing the value of \u02dcy which achieves the maximum.\nL(yt, \u02dcy) \u2212 (wt)\n(cid:62)\n\n(\u03c6(xt, yt) \u2212 \u03c6(xt, \u02dcy))\n\n(cid:62)\n\n(wt)\n\n\u03c6(xt, \u02dcy) + L(yt, \u02dcy)\n\n(4)\n\n2\n\n\fThis yields the following perceptron-like update rule where the update direction is the negative of\nthe sub-gradient of the loss and \u03b7t is a learning rate.\n\n(5)\nEquation (4) is often referred to as loss-adjusted inference. The use of loss-adjusted inference causes\nthe rule update (5) to be at least in\ufb02uenced by the loss function.\nHere we consider the following perceptron-like update rule where \u03b7t is a time-varying learning rate\nand \u0001t is a time-varying loss-adjustment weight.\n\nhinge)(cid:1)\nwt+1 = wt + \u03b7t(cid:0)\u03c6(xt, yt) \u2212 \u03c6(xt, yt\nwt+1 = wt + \u03b7t(cid:0)\u03c6(xt, ywt(xt)) \u2212 \u03c6(xt, yt\n\ndirect)(cid:1)\n\n(6)\n\n(7)\n\nyt\ndirect = argmax\n\n(wt)(cid:62)\u03c6(xt, \u02dcy) + \u0001tL(y, \u02dcy)\n\n\u02dcy\u2208Y\n\nIn the update (6) we view yt\ndirect as being worse than ywt(xt). The update direction moves away\nfrom feature vectors of larger-loss labels. Note that the reference label yt in (5) has been replaced\nby the inferred label ywt(x) in (6). The main result of this paper is that under mild conditions the\nexpected update direction of (6) approaches the negative direction of \u2207wE [L(y, yw(x))] in the limit\nas the update weight \u0001t goes to zero. In practice we use a different version of the update rule which\nmoves toward better labels rather than away from worse labels. The toward-better version is given\nin Section 5. Our main theorem applies equally to the toward-better and away-from-worse versions\nof the rule.\n\n3 The Loss Gradient Theorem\n\nThe main result of this paper is the following theorem.\nTheorem 1. For a \ufb01nite set Y of possible output values, and for w in general position as de\ufb01ned\nbelow, we have the following where ydirect is a function of w, x, y and \u0001.\n\nwhere\n\n\u2207wE [L(y, yw(x))] = lim\n\u0001\u21920\n\n1\n\u0001\n\nE [\u03c6(x, ydirect) \u2212 \u03c6(x, yw(x)))]\n\nydirect = argmax\n\n\u02dcy\u2208Y\n\nw(cid:62)\u03c6(x, \u02dcy) + \u0001L(y, \u02dcy)\n\nWe prove this theorem in the case of only two labels where we have y \u2208 {\u22121, 1}. Although the\nproof is extended to the general case in a straight forward manner, we omit the general case to\nmaintain the clarity of the presentation. We assume an input set X and a probability distribution or\na measure \u03c1 on X \u00d7 {\u22121, 1} and a loss function L(y, y(cid:48)) for y, y(cid:48) \u2208 {\u22121, 1}. Typically the loss\nL(y, y(cid:48)) is zero if y = y(cid:48) but the loss of a false positive, namely L(\u22121, 1), may be different from the\nloss of a false negative, L(1,\u22121).\nBy de\ufb01nition the gradient of expected loss satis\ufb01es the following condition for any vector \u2206w \u2208 Rd.\n\n\u2206w(cid:62)\u2207wE [L(y, yw(x))] = lim\n\u0001\u21920\n\nE [L(y, yw+\u0001\u2206w(x))] \u2212 E [L(y, yw(x))]\n\n\u0001\n\nUsing this observation, the direct loss theorem is equivalent to the following\n\nE [L(y, yw+\u0001\u2206w(x)) \u2212 L(y, yw(x))]\n\n\u0001\n\nlim\n\u0001\u21920\n\n= lim\n\u0001\u21920\n\n(\u2206w)(cid:62)E [\u03c6(x, ydirect) \u2212 \u03c6(x, yw(x))]\n\n\u0001\n\n(8)\n\nFor the binary case we de\ufb01ne \u2206\u03c6(x) = \u03c6(x, 1)\u2212\u03c6(x,\u22121). Under this convention we have yw(x) =\nsign(w(cid:62)\u2206\u03c6(x)). We \ufb01rst focus on the left hand side of (8). If the two labels yw+\u0001\u2206w(x) and yw(x)\nare the same then the quantity inside the expectation is zero. We now de\ufb01ne the following two sets\nwhich correspond to the set of inputs x for which these two labels are different.\n\nS+\n\u0001 = {x : yw(x) = \u22121, yw+\u0001\u2206w(x) = 1}\n\n= (cid:8)x : w(cid:62)\u2206\u03c6(x) < 0, (w + \u0001\u2206w)(cid:62)\u2206\u03c6(x) \u2265 0(cid:9)\n= (cid:8)x : w(cid:62)\u2206\u03c6(x) \u2208 [\u2212\u0001(\u2206w)(cid:62)\u2206\u03c6(x), 0)(cid:9)\n\n3\n\n\fE(cid:2)\u2206L(y) \u00b7 1{wt\u2206\u03c6(x)\u2208[\u2212\u0001(\u2206w)(cid:62)\u2206\u03c6(x),0)}(cid:3)\n\u2212E(cid:2)\u2206L(y) \u00b7 1{wt\u2206\u03c6(x)\u2208(0,\u2212\u0001(\u2206w)(cid:62)\u2206\u03c6(x)]}(cid:3)\n\n(a)\n\nE(cid:2)\u2206L(y) \u00b7 (\u2206w)T \u2206\u03c6(x) \u00b7 1{wt\u2206\u03c6(x)\u2208[0,\u0001]}(cid:3)\n\n(b)\n\nFigure 1: Geometrical interpretation of the loss gradient. In (a) we illustrate the integration of the\n(the green\n\u0001 and the constant value \u2212\u2206L(y) over the set S\u2212\u0001\nconstant value \u2206L(y) over the set S+\nIn (b) we\narea). The lines represent the decision boundaries de\ufb01ned by the associated vectors.\n\u0001 = {x : wt\u2206\u03c6(x) \u2208 [0, \u0001]} and\nshow the integration of \u2206L(y)(\u2206w)(cid:62)\u2206\u03c6(x) over the sets U +\nU\u2212\u0001 = {x : wt\u2206\u03c6(x) \u2208 [\u2212\u0001(\u2206w)(cid:62)\u2206\u03c6(x), 0)}. The key observation of the proof is that under very\ngeneral conditions these integrals are asymptotically equivalent in the limit as \u0001 goes to zero.\n\nand\n\n= (cid:8)x : w(cid:62)\u2206\u03c6(x) \u2265 0, (w + \u0001\u2206w)(cid:62)\u2206\u03c6(x) < 0(cid:9)\nS\u2212\u0001 = {x : yw(x) = 1, yw+\u0001\u2206w(x) = \u22121}\n= (cid:8)x : w(cid:62)\u2206\u03c6(x) \u2208 [0,\u2212\u0001(\u2206w)(cid:62)\u2206\u03c6(x))(cid:9)\n\u2212 E(cid:104)\n(cid:105)\n\n(cid:105)\nWe de\ufb01ne \u2206L(y) = L(y, 1) \u2212 L(y,\u22121) and then write the left hand side of (8) as follows.\n\u2212\n\u0001 }\n\nE [L(y, yw+\u0001\u2206w(x)) \u2212 L(y, yw(x))] = E(cid:104)\n\n(9)\nThese expectations are shown as integrals in Figure 1 (a) where the lines in the \ufb01gure represent the\ndecision boundaries de\ufb01ned by w and w + \u0001\u2206w.\nTo analyze this further we use the following lemma.\nLemma 1. Let Z(z), U (u) and V (v) be three real-valued random variables whose joint measure\n\u03c1(S) =(cid:82)\n\u03c1 can be expressed as a measure \u00b5 on U and V and a bounded continuous conditional density\nfunction f (z|u, v). More rigorously, we require that for any \u03c1-measurable set S \u2286 R3 we have\nfollowing.\n\n{z,u,v\u2208S}dz(cid:3) d\u00b5(u, v). For any three such random variables we have the\n(cid:3)\n\n(cid:2)(cid:82)\n(cid:0)E\u03c1\n\n(cid:2)U \u00b7 1\n\n(cid:2)U \u00b7 1\n\nz f (z|u, v)1\n\n\u2212 E\u03c1\n\n{z\u2208[0,\u0001V ]}\n\n{z\u2208[\u0001V,0]}\n\n\u2206L(y)1\n\n\u2206L(y)1\n\nlim\n\u0001\u2192+0\n\n{x\u2208S+\n\u0001 }\n\n{x\u2208S\n\n(cid:3)\n\n1\n\u0001\n\nu,v\n\nProof. First we note the following where V + denotes max(0, V ).\n\n(cid:3)(cid:1) = E\u00b5 [U V \u00b7 f (0|u, v)]\n(cid:2)U V \u00b7 1\n(cid:34)\n(cid:35)\n\n= lim\n\u0001\u2192+0\n\nE\u03c1\n\n1\n\u0001\n\nf (z|u, v)dz\n\n{z\u2208[0,\u0001]}\n\n0\n\nU\n\n1\n\u0001\n\nE\u00b5\n\n\u0001\u2192+0\n\n= E\u00b5\n\n(cid:3) = lim\n(cid:3) = lim\n\u0001\u2192+0\n= \u2212E\u00b5\n\n(cid:90) \u0001V\n(cid:2)U V + \u00b7 f (0|U, V )(cid:3)\n(cid:20)\n(cid:90) 0\n(cid:2)U V \u2212 \u00b7 f (0|U, V )(cid:3)\n\nE\u00b5\n\n1\n\u0001\n\nU\n\n\u0001V\n\n4\n\n(cid:21)\n\nf (z|u, v)dz\n\n(cid:2)U \u00b7 1\n(cid:2)U \u00b7 1\n\nE\u03c1\n\n1\n\u0001\n\nlim\n\u0001\u2192+0\n\n{z\u2208[0,\u0001V )}\n\n1\n\u0001\n\nlim\n\u0001\u2192+0\n\nE\u03c1\n\n{z\u2208(\u0001V,0]}\n\nSimilarly we have the following where V \u2212 denotes min(0, V ).\n\nw+\u2206ww\u2212\u2206L(y)S\u2212\uffff={0<w\uffff\u2206\u03c6(x)<\u2212\uffff\u2206w\uffff\u2206\u03c6(x)}\u2206L(y)S+\uffff\u2206\u03c61(x)\u2206\u03c62(x)w+\u2206ww\uffff\u2212slice\u2206L(y)\u00b7\u2206w\uffff\u2206\u03c6(x)\u2206L(y)\u00b7\u2206w\uffff\u2206\u03c6(x)\fSubtracting these two expressions gives the following.\n\n(cid:2)U V + \u00b7 f (0|U, V )(cid:3) + E\u00b5\n\n(cid:2)U V \u2212 \u00b7 f (0|U, V )(cid:3) = E\u00b5\n\nE\u00b5\n\n(cid:2)U (V + + V \u2212) \u00b7 f (0|U, V )(cid:3)\n\n= E\u00b5 [U V \u00b7 f (0|U, V )]\n\nApplying Lemma 1 to (9) with Z being the random variable wT \u2206\u03c6(x), U being the random variable\n\u2212\u2206L(y) and V being \u2212(\u2206w)T \u2206\u03c6(x) yields the following.\n\u2206L(y) \u00b7 1\n\n(cid:105)\n\n(cid:105)\n\n\u2212\n\u0001 }\n\nE(cid:104)\n\u2212 E(cid:104)\nE(cid:2)\u2206L(y) \u00b7 (\u2206w)(cid:62)\u2206\u03c6(x) \u00b7 1\n\n{x\u2208S+\n\u0001 }\n\n1\n\u0001\n1\n\u0001\n\n(\u2206w)(cid:62)\u2207wE [L(y, yw(x))] = lim\n\u0001\u2192+0\n= lim\n\u0001\u2192+0\n\n\u2206L(y) \u00b7 1\n{x\u2208S\n{w(cid:62)\u2206\u03c6\u2208[0,\u0001]}\n\n(cid:3)\n\n(10)\n\nOf course we need to check that the conditions of Lemma 1 hold. This is where we need a general\nposition assumption for w. We discuss this issue brie\ufb02y in Section 3.1.\nNext we consider the right hand side of (8). If the two labels ydirect and yw(x) are the same then the\nquantity inside the expectation is zero. We note that we can write ydirect as follows.\n\nydirect = sign(cid:0)w(cid:62)\u2206\u03c6(x) + \u0001\u2206L(y)(cid:1)\n\nWe now de\ufb01ne the following two sets which correspond to the set of pairs (x, y) for which yw(x)\nand ydirect are different.\n\nB+\n\u0001 = {(x, y) : yw(x) = \u22121, ydirect = 1}\n\nB\u2212\u0001 = {(x, y) : yw(x) = 1, ydirect = \u22121}\n\n= (cid:8)(x, y) : w(cid:62)\u2206\u03c6(x) < 0, w(cid:62)\u2206\u03c6(x) + \u0001\u2206L(y) \u2265 0(cid:9)\n= (cid:8)(x, y) : w(cid:62)\u2206\u03c6(x) \u2208 [\u2212\u0001\u2206L(y)(x), 0)(cid:9)\n= (cid:8)(x, y) : w(cid:62)\u2206\u03c6(x) \u2265 0, w(cid:62)\u2206\u03c6(x) + \u0001\u2206L(y) < 0(cid:9)\n= (cid:8)(x, y) : w(cid:62)\u2206\u03c6(x) \u2208 [0,\u2212\u0001\u2206L(y))(cid:9)\nE(cid:2)(\u2206w)(cid:62) (\u03c6(x, ydirect) \u2212 \u03c6(x, yw(x)))(cid:3)\n= E(cid:104)\n\u2212 E(cid:104)\n\n(\u2206w)(cid:62)\u2206\u03c6(x) \u00b7 1\n\n(\u2206w)(cid:62)\u2206\u03c6(x) \u00b7 1\n\n{(x,y)\u2208B+\n\u0001 }\n\n(cid:105)\n\n(cid:105)\n\n(11)\n\n{(x,y)\u2208B\n\n\u2212\n\u0001 }\n\nWe now have the following.\n\nThese expectations are shown as integrals in Figure 1 (b). Applying Lemma 1 to (11) with Z set to\nw(cid:62)\u2206\u03c6(x), U set to \u2212(\u2206w)(cid:62)\u2206\u03c6(x) and V set to \u2212\u2206L(y) gives the following.\n(cid:3)\n\n(\u2206w)(cid:62)E [\u03c6(x, ydirect) \u2212 \u03c6(x, yw(x))]\n\nE(cid:2)(\u2206w)(cid:62)\u2206\u03c6(x) \u00b7 \u2206L(y) \u00b7 1\n\n{w(cid:62)\u2206\u03c6(x)\u2208[0,\u0001]}\n\n1\n\u0001\n1\n\u0001\n\n(12)\n\nlim\n\u0001\u2192+0\n= lim\n\u0001\u2192+0\n\nTheorem 1 now follows from (10) and (12).\n\n3.1 The General Position Assumption\n\nThe general position assumption is needed to ensure that Lemma 1 can be applied in the proof of\nTheorem 1. As a general position assumption, it is suf\ufb01cient, but not necessary, that w (cid:54)= 0 and\n\u03c6(x, y) has a bounded density on Rd for each \ufb01xed value of y. It is also suf\ufb01cient that the range of\nthe feature map is a submanifold of Rd and \u03c6(x, y) has a bounded density relative to the surface of\nthat submanifold, for each \ufb01xed value of y. More complex distributions and feature maps are also\npossible.\n\n5\n\n\f4 Extensions: Approximate Inference and Latent Structure\n\nIn many applications the inference problem (1) is intractable. Most commonly we have some form\nof graphical model. In this case the score w(cid:62)\u03c6(x, y) is de\ufb01ned as the negative energy of a Markov\nrandom \ufb01eld (MRF) where x and y are assignments of values to nodes of the \ufb01eld. Finding a lowest\nenergy value for y in (1) in a general graphical model is NP-hard.\nA common approach to an intractable optimization problem is to de\ufb01ne a convex relaxation of the\nobjective function. In the case of graphical models this can be done by de\ufb01ning a relaxation of a\nmarginal polytope [11]. The details of the relaxation are not important here. At a very abstract\nlevel the resulting approximate inference problem can be de\ufb01ned as follows where the set R is a\nrelaxation of the set Y, and corresponds to the extreme points of the relaxed polytope.\n\nrw(x) = argmax\n\nr\u2208R\n\nw(cid:62)\u03c6(x, r)\n\n(13)\n\nWe assume that for y \u2208 Y and r \u2208 R we can assign a loss L(y, r). In the case of a relaxation of\nthe marginal polytope of a graphical model we can take L(y, r) to be the expectation over a random\nrounding of r to \u02dcy of L(y, \u02dcy). For many loss functions, such as weighted Hamming loss, one can\ncompute L(y, r) ef\ufb01ciently. The training problem is then de\ufb01ned by the following equation.\n\nw\u2217 = argmin\n\nw\n\nE [L(y, rw(x))]\n\n(14)\n\nNote that (14) directly optimizes the performance of the approximate inference algorithm. The pa-\nrameter setting optimizing approximate inference might be signi\ufb01cantly different from the parameter\nsetting optimizing the loss under exact inference.\nThe proof of Theorem 1 generalizes to (14) provided that R is a \ufb01nite set, such as the set of vertices\nof a relaxation of the marginal polytope. So we immediately get the following generalization of\nTheorem 1.\n\nwhere\n\n\u2207wE\n\n(x,y)\u223c\u03c1 [L(y, rw(x))] = lim\n\u0001\u21920\n\n1\n\u0001\n\nE [\u03c6(x, rdirect) \u2212 \u03c6(x, rw(x))]\n\nrdirect = argmax\n\n\u02dcr\u2208R\n\nw(cid:62)\u03c6(x, \u02dcr) + \u0001L(y, \u02dcr)\n\nAnother possible extension involves hidden structure. In many applications it is useful to introduce\nhidden information into the inference optimization problem. For example, in machine translation\nwe might want to construct parse trees for the both the input and output sentence. In this case the\ninference equation can be written as follows where h is the hidden information.\n\nyw(x) = argmax\n\ny\u2208Y\n\nmax\nh\u2208H\n\nw(cid:62)\u03c6(x, y, h)\n\n(15)\n\nIn this case we can take the training problem to again be de\ufb01ned by (2) but where yw(x) is de\ufb01ned\nby (15).\nLatent information can be handled by the equations of approximate inference but where R is reinter-\npreted as the set of pairs (y, h) with y \u2208 Y and h \u2208 H. In this case L(y, r) has the form L(y, (y(cid:48), h))\nwhich we can take to be equal to L(y, y(cid:48)).\n\n5 Experiments\n\nIn this section we present empirical results on the task of phoneme-to-speech alignment. Phoneme-\nto-speech alignment is used as a tool in developing speech recognition and text-to-speech systems.\nIn the phoneme alignment problem each input x represents a speech utterance, and consists of a pair\n(s, p) of a sequence of acoustic feature vectors, s = (s1, . . . , sT ), where st \u2208 Rd, 1 \u2264 t \u2264 T ; and a\nsequence of phonemes p = (p1, . . . , pK), where pk \u2208 P, 1 \u2264 k \u2264 K is a phoneme symbol and P is\na \ufb01nite set of phoneme symbols. The lengths K and T can be different for different inputs although\ntypically we have T signi\ufb01cantly larger than K. The goal is to generate an alignment between the\ntwo sequences in the input. Sometimes this task is called forced-alignment because one is forced\n\n6\n\n\fTable 1: Percentage of correctly positioned phoneme boundaries, given a prede\ufb01ned tolerance on\nthe TIMIT corpus. Results are reported on the whole TIMIT test-set (1344 utterances).\n\nt \u2264 10ms\n\n\u03c4-alignment accuracy [%]\nt \u2264 20ms\nt \u2264 30ms\n\nt \u2264 40ms\n\n\u03c4-insensitive\n\nloss\n\nBrugnara et al. (1993)\nKeshet (2007)\nHosom (2009)\nDirect loss min. (trained \u03c4-alignment)\nDirect loss min. (trained \u03c4-insensitive)\n\n74.6\n80.0\n79.30\n86.01\n85.72\n\n88.8\n92.3\n93.36\n94.08\n94.21\n\n94.1\n96.4\n96.74\n97.08\n97.21\n\n96.8\n98.2\n98.22\n98.44\n98.60\n\n-\n-\n\n0.278\n0.277\n\nto interpret the given acoustic signal as the given phoneme sequence. The output y is a sequence\n(y1, . . . , yK), where 1 \u2264 yk \u2264 T is an integer giving the start frame in the acoustic sequence of\nthe k-th phoneme in the phoneme sequence. Hence the k-th phoneme starts at frame yk and ends at\nframe yk+1\u22121.\nTwo types of loss functions are used to quantitatively assess alignments. The \ufb01rst loss is called the\n\u03c4-alignment loss and it is de\ufb01ned as\n\nL\u03c4-alignment(\u00afy, \u00afy(cid:48)) =\n\n1\n\n|\u00afy| |{k : |yk \u2212 y(cid:48)k| > \u03c4}| .\n\n(16)\n\nIn words, this loss measures the average number of times the absolute difference between the pre-\ndicted alignment sequence and the manual alignment sequence is greater than \u03c4. This loss with\ndifferent values of \u03c4 was used to measure the performance of the learned alignment function in\n[1, 9, 4]. The second loss, called \u03c4-insensitive loss was proposed in [5] as is de\ufb01ned as follows.\n\nL\u03c4-insensitive(\u00afy, \u00afy(cid:48)) =\n\nmax{|yk \u2212 y(cid:48)k| \u2212 \u03c4, 0}\n\n(17)\n\n1\n|\u00afy|\n\nThis loss measures the average disagreement between all the boundaries of the desired alignment\nsequence and the boundaries of predicted alignment sequence where a disagreement of less than \u03c4\nis ignored. Note that \u03c4-insensitive loss is continuous and convex while \u03c4-alignment is discontinuous\nand non-convex. Rather than use the \u201caway-from-worse\u201d update given by (6) we use the \u201ctoward-\nbetter\u201d update de\ufb01ned as follows. Both updates give the gradient direction in the limit of small \u0001 but\nthe toward-better version seems to perform better for \ufb01nite \u0001.\n\nwt+1 = wt + \u03b7t(cid:0)\u03c6(\u00afxt, \u00afyt\n\ndirect) \u2212 \u03c6(\u00afxt, \u00afywt(\u00afxt))(cid:1)\n\n\u00afyt\ndirect = argmax\n\n\u02dcy\u2208Y\n\n(wt)(cid:62)\u03c6(\u00afxt, \u02dcy) \u2212 \u0001tL(\u00afy, \u02dcy)\n\nOur experiments are on the TIMIT speech corpus for which there are published benchmark results\n[1, 5, 4]. The corpus contains aligned utterances each of which is a pair (x, y) where x is a pair of\na phonetic sequence and an acoustic sequence and y is a desired alignment. We divided the training\nportion of TIMIT (excluding the SA1 and SA2 utterances) into three disjoint parts containing 1500,\n1796, and 100 utterances, respectively. The \ufb01rst part of the training set was used to train a phoneme\nframe-based classi\ufb01er, which given a speech frame and a phoneme, outputs the con\ufb01dent that the\nphoneme was uttered in that frame. The phoneme frame-based classi\ufb01er is then used as part of a\nseven dimensional feature map \u03c6(x, y) = \u03c6((\u00afs, \u00afp), \u00afy) as described in [5]. The feature set used to\ntrain the phoneme classi\ufb01er consisted of the Mel-Frequency Cepstral Coef\ufb01cient (MFCC) and the\nlog-energy along with their \ufb01rst and second derivatives (\u2206+\u2206\u2206) as described in [5]. The classi\ufb01er\nused a Gaussian kernel with \u03c32 = 19 and a trade-off parameter C = 5.0. The complete set of 61\nTIMIT phoneme symbols were mapped into 39 phoneme symbols as proposed by [6], and was used\nthroughout the training process.\nThe seven dimensional weight vector w was trained on the second set of 1796 aligned utterances.\nWe trained twice, once for \u03c4-alignment loss and once for \u03c4-insensitive loss, with \u03c4 = 10 ms in both\ncases. Training was done by \ufb01rst setting w0 = 0 and then repeatedly selecting one of the 1796\ntraining pairs at random and performing the update (6) with \u03b7t = 1 and \u0001t set to a \ufb01xed value \u0001. It\nshould be noted that if w0 = 0 and \u0001t and \u03b7t are both held constant at \u0001 and \u03b7 respectively, then the\n\n7\n\n\fdirection of wt is independent of the choice of \u03b7. These updates are repeated until the performance\nof wt on the third data set (the hold-out set) begins to degrade. This gives a form of regularization\nknown as early stopping. This was repeated for various values of \u0001 and a value of \u0001 was selected\nbased on the resulting performance on the 100 hold-out pairs. We selected \u0001 = 1.1 for both loss\nfunctions.\nWe scored the performance of our system on the whole TIMIT test set of 1344 utterances using\n\u03c4-alignment accuracy (one minus the loss) with \u03c4 set to each of 10, 20, 30 and 40 ms and with \u03c4-\ninsensitive loss with \u03c4 set to 10 ms. As should be expected, for \u03c4 equal to 10 ms the best performance\nis achieved when the loss used in training matches the loss used in test. Larger values of \u03c4 correspond\nto a loss function that was not used in training. The results are given in Table 1. We compared our\nresults with [4], which is an HMM/ANN-based system, and with [5], which is based on structural\nSVM training for \u03c4-insensitive loss. Both systems are considered to be state-of-the-art results on\nthis corpus. As can be seen, our algorithm outperforms the current state-of-the-art results in every\ntolerance value. Also, as might be expected, the \u03c4-insensitive loss seems more robust to the use of a\n\u03c4 value at test time that is larger than the \u03c4 value used in training.\n\n6 Open Problems and Discussion\n\nprovided that both \u03b7t and \u0001t go to zero while(cid:80)\n\nThe main result of this paper is the loss gradient theorem of Section 3. This theorem provides a\ntheoretical foundation for perceptron-like training methods with updates computed as a difference\nbetween the feature vectors of two different inferred outputs where at least one of those outputs\nis inferred with loss-adjusted inference. Perceptron-like training methods using feature differences\nbetween two inferred outputs have already been shown to be successful for machine translation but\ntheoretical justi\ufb01cation has been lacking. We also show the value of these training methods in a\nphonetic alignment problem.\nAlthough we did not give an asymptotic convergence results it should be straightforward to show\nthat under the update given by (6) we have that wt converges to a local optimum of the objective\nt \u03b7t\u0001t goes to in\ufb01nity. For example one could take\n\u03b7t = \u0001t = 1/\u221at.\nAn open problem is how to properly incorporate regularization in the case where only a \ufb01nite cor-\npus of training data is available. In our phoneme alignment experiments we trained only a seven\ndimensional weight vector and early stopping was used as regularization. It should be noted that\nnaive regularization with a norm of w, such as regularizing with \u03bb||w||2, is nonsensical as the loss\nE [L(y, yw(x))] is insensitive to the norm of w. Regularization is typically done with a surrogate\nloss function such as hinge loss. Regularization remains an open theoretical issue for direct gradi-\nent descent on a desired loss function on a \ufb01nite training sample. Early stopping may be a viable\napproach in practice.\nMany practical computational problems in areas such as computational linguistics, computer vision,\nspeech recognition, robotics, genomics, and marketing seem best handled by some form of score op-\ntimization. In all such applications we have two optimization problems. Inference is an optimization\nproblem (approximately) solved during the operation of the \ufb01elded software system. Training in-\nvolves optimizing the parameters of the scoring function to achieve good performance of the \ufb01elded\nsystem. We have provided a theoretical foundation for a certain perceptron-like training algorithm\nby showing that it can be viewed as direct stochastic gradient descent on the loss of the inference\nsystem. The main point of this training method is to incorporate domain-speci\ufb01c loss functions, such\nas the BLEU score in machine translation, directly into the training process with a clear theoretical\nfoundation. Hopefully the theoretical framework provided here will prove helpful in the continued\ndevelopment of improved training methods.\n\nReferences\n\n[1] F. Brugnara, D. Falavigna, and M. Omologo. Automatic segmentation and labeling of speech\n\nbased on hidden markov models. Speech Communication, 12:357\u2013370, 1993.\n\n[2] D. Chiang, K. Knight, and W. Wang. 11,001 new features for statistical machine translation.\n\nIn Proc. NAACL, 2009, 2009.\n\n8\n\n\f[3] M. Collins. Discriminative training methods for hidden markov models: Theory and experi-\nments with perceptron algorithms. In Conference on Empirical Methods in Natural Language\nProcessing, 2002.\n\n[4] J.-P. Hosom. Speaker-independent phoneme alignment using transition-dependent states.\n\nSpeech Communication, 51:352\u2013368, 2009.\n\n[5] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan. A large margin algorithm for speech\nand audio segmentation. IEEE Trans. on Audio, Speech and Language Processing, Nov. 2007.\n[6] K.-F. Lee and H.-W. Hon. Speaker independent phone recognition using hidden markov mod-\n\nels. IEEE Trans. Acoustic, Speech and Signal Proc., 37(2):1641\u20131648, 1989.\n\n[7] P. Liang, A. Bouchard-Ct, D. Klein, and B. Taskar. An end-to-end discriminative approach to\nmachine translation. In International Conference on Computational Linguistics and Associa-\ntion for Computational Linguistics (COLING/ACL), 2006.\n\n[8] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural\n\nInformation Processing Systems 17, 2003.\n\n[9] D.T. Toledano, L.A.H. Gomez, and L.V. Grande. Automatic phoneme segmentation. IEEE\n\nTrans. Speech and Audio Proc., 11(6):617\u2013625, 2003.\n\n[10] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\nand interdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484,\n2005.\n\n[11] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, December 2008.\n\n9\n\n\f", "award": [], "sourceid": 866, "authors": [{"given_name": "Tamir", "family_name": "Hazan", "institution": null}, {"given_name": "Joseph", "family_name": "Keshet", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}