{"title": "Structured Prediction via the Extragradient Method", "book": "Advances in Neural Information Processing Systems", "page_first": 1345, "page_last": 1352, "abstract": "", "full_text": "Structured Prediction via the Extragradient\n\nMethod\n\nBen Taskar\n\nComputer Science\n\nUC Berkeley, Berkeley, CA 94720\ntaskar@cs.berkeley.edu\n\nSimon Lacoste-Julien\n\nComputer Science\n\nUC Berkeley, Berkeley, CA 94720\nslacoste@cs.berkeley.edu\n\nMichael I. Jordan\n\nComputer Science and Statistics\nUC Berkeley, Berkeley, CA 94720\njordan@cs.berkeley.edu\n\nAbstract\n\nWe present a simple and scalable algorithm for large-margin estima-\ntion of structured models, including an important class of Markov net-\nworks and combinatorial models. We formulate the estimation problem\nas a convex-concave saddle-point problem and apply the extragradient\nmethod, yielding an algorithm with linear convergence using simple gra-\ndient and projection calculations. The projection step can be solved us-\ning combinatorial algorithms for min-cost quadratic \ufb02ow. This makes the\napproach an ef\ufb01cient alternative to formulations based on reductions to\na quadratic program (QP). We present experiments on two very different\nstructured prediction tasks: 3D image segmentation and word alignment,\nillustrating the favorable scaling properties of our algorithm.\n\nIntroduction\n\n1\nThe scope of discriminative learning methods has been expanding to encompass prediction\ntasks with increasingly complex structure. Much of this recent development builds upon\ngraphical models to capture sequential, spatial, recursive or relational structure, but as we\nwill discuss in this paper, the structured prediction problem is broader still. For graphical\nmodels, two major approaches to discriminative estimation have been explored: (1) maxi-\nmum conditional likelihood [13] and (2) maximum margin [6, 1, 20]. For the broader class\nof models that we consider here, the conditional likelihood approach is intractable, but the\nlarge margin formulation yields tractable convex problems.\n\nWe interpret the term structured output model very broadly, as a compact scoring scheme\nover a (possibly very large) set of combinatorial structures and a method for \ufb01nding the\nhighest scoring structure. In graphical models, the scoring scheme is embodied in a prob-\nability distribution over possible assignments of the prediction variables as a function of\ninput variables. In models based on combinatorial problems, the scoring scheme is usu-\nally a simple sum of weights associated with vertices, edges, or other components of a\nstructure; these weights are often represented as parametric functions of a set of features.\nGiven training instances labeled by desired structured outputs (e.g., matchings) and a set of\n\n\ffeatures that parameterize the scoring function, the learning problem is to \ufb01nd parameters\nsuch that the highest scoring outputs are as close as possible to the desired outputs.\n\nExample of prediction tasks solved via combinatorial optimization problems include bipar-\ntite and non-bipartite matching in alignment of 2D shapes [5], word alignment in natural\nlanguage translation [14] and disul\ufb01de connectivity prediction for proteins [3]. All of these\nproblems can be formulated in terms of a tractable optimization problem. There are also\ninteresting subfamilies of graphical models for which large-margin methods are tractable\nwhereas likelihood-based methods are not; an example is the class of Markov random \ufb01elds\nwith restricted potentials used for object segmentation in vision [12, 2].\n\nTractability is not necessarily suf\ufb01cient to obtain algorithms that work effectively in prac-\ntice. In particular, although the problem of large margin estimation can be formulated as a\nquadratic program (QP) in several cases of interest [2, 19], and although this formulation\nexploits enough of the problem structure so as to achieve a polynomial representation in\nterms of the number of variables and constraints, off-the-shelf QP solvers scale poorly with\nproblem and training sample size for these models. To solve large-scale machine learning\nproblems, researchers often turn to simple gradient-based algorithms, in which each indi-\nvidual step is cheap in terms of computation and memory. Examples of this approach in the\nstructured prediction setting include the Structured Sequential Minimal Optimization algo-\nrithm [20, 18] and the Structured Exponentiated Gradient algorithm [4]. These algorithms\nare \ufb01rst-order methods for solving QPs arising from low-treewidth Markov random \ufb01elds\nand other decomposable models. They are able to scale to signi\ufb01cantly larger problems\nthan off-the-shelf QP solvers. However, they are limited in scope in that they rely on dy-\nnamic programming to compute essential quantities such as gradients. They do not extend\nto models in which dynamic programming is not applicable, for example, to problems such\nas matchings and min-cuts.\n\nIn this paper, we present an estimation methodology for structured prediction problems\nthat does not require a general-purpose QP solver. We propose a saddle-point formulation\nwhich allows us to exploit simple gradient-based methods [11] with linear convergence\nguarantees. Moreover, we show that the key computational step in these methods\u2014a cer-\ntain projection operation\u2014inherits the favorable computational complexity of the underly-\ning optimization problem. This important result makes our approach viable computation-\nally. In particular, for matchings and min-cuts, projection involves a min-cost quadratic\n\ufb02ow computation, a problem for which ef\ufb01cient, highly-specialized algorithms are avail-\nable. We illustrate the effectiveness of this approach on two very different large-scale\nstructured prediction tasks: 3D image segmentation and word alignment in translation.\n\n2 Structured models\nWe begin by discussing two special cases of the general framework that we subsequently\npresent: (1) a class of Markov networks used for segmentation, and (2) a bipartite matching\nmodel for word alignment. Despite signi\ufb01cant differences in the setup for these models,\nthey share the property that in both cases the problem of \ufb01nding the highest-scoring output\ncan be formulated as a linear program (LP).\nMarkov networks. We consider a special class of Markov networks, common in vision\napplications, in which inference reduces to a tractable min-cut problem [7]. Focusing on\nbinary variables, y = fy1; : : : ; yN g, and pairwise potentials, we de\ufb01ne a joint distribution\nover f0; 1gN via P (y) / Qj2V (cid:30)j(yj)Qjk2E (cid:30)jk(yj; yk), where (V; E) is an undirected\ngraph, and where f(cid:30)j(yj); j 2 Vg are the node potentials and f(cid:30)jk(yj; yk); jk 2 Eg are\nthe edge potentials.\n\nIn image segmentation (see Fig. 1(a)), the node potentials capture local evidence about\nthe label of a pixel or laser scan point. Edges usually connect nearby pixels in an image,\nand serve to correlate their labels. Assuming that such correlations tend to be positive\n\n\f(cid:3)(cid:9) (cid:4)(cid:5)\n\n(cid:1)(cid:2)(cid:3)(cid:4)\n(cid:4)(cid:2)(cid:8)\n(cid:11) (cid:3)(cid:4)(cid:8)\n(cid:6) (cid:4)\n\n(cid:10) (cid:4)(cid:5)\n\n(cid:4)(cid:2)(cid:8)\n(cid:6) (cid:3)(cid:15)\n(cid:20)\n\n(cid:18) (cid:4)(cid:17)\n\n(cid:6) (cid:4)\n(cid:13) (cid:25)(cid:4)\n(cid:18) (cid:26)(cid:22)\n\n(cid:5) (cid:4)(cid:6)\n\n(cid:20)\n\n(cid:5) (cid:4)(cid:5)\n\n(cid:11) (cid:4)(cid:5)\n\nFigure 1: Examples of structured prediction applications: (a) articulated object segmenta-\ntion and (b) word alignment in machine translation.\n\n(a)\n\n(b)\n\n(connected nodes tend to have the same label), we restrict the form of edge potentials to be\nof the form (cid:30)jk(yj; yk) = expf(cid:0)sjk1I(yj 6= yk)g, where sjk is a non-negative penalty for\nassigning yj and yk different labels. Expressing node potentials as (cid:30)j(yj) = expfsjyjg,\nwe have P (y) / expnPj2V sjyj (cid:0) Pjk2E sjk1I(yj 6= yk)o. Under this restriction of the\npotentials, it is known that the problem of computing the maximizing assignment, y(cid:3) =\narg max P (y j x), has a tractable formulation as a min-cut problem [7]. In particular, we\nobtain the following LP:\nsjzj (cid:0) X\n\nzj (cid:0) zk (cid:20) zjk; zk (cid:0) zj (cid:20) zjk; 8jk 2 E:\n\n0(cid:20)z(cid:20)1 X\nmax\n\nsjkzjk\n\ns:t:\n\n(1)\n\nj2V\n\njk2E\n\nIn this LP, a continuous variable zj is a relaxation of the binary variable yj. Note that the\nconstraints are equivalent to jzj (cid:0) zkj (cid:20) zjk. Because sjk is positive, zjk = jzk (cid:0) zjj at the\nmaximum, which is equivalent to 1I(zj 6= zk) if the zj; zk variables are binary. An integral\noptimal solution always exists, as the constraint matrix is totally unimodular [17] (that is,\nthe relaxation is not an approximation).\nWe can parametrize the node and edge weights sj and sjk in terms of user-provided features\nxj and xjk associated with the nodes and edges. In particular, in 3D range data, xj might be\nspin image features or spatial occupancy histograms of a point j, while xjk might include\nthe distance between points j and k, the dot-product of their normals, etc. The simplest\nmodel of dependence is a linear combination of features: sj = w>\nn fn(xj) and sjk =\ne fe(xjk), where wn and we are node and edge parameters, and fn and fe are node and\nw>\nedge feature mappings, of dimension dn and de, respectively. To ensure non-negativity\nof sjk, we assume the edge features fe to be non-negative and restrict we (cid:21) 0. This\nconstraint is easily incorporated into the formulation we present below. We assume that\nthe feature mappings f are provided by the user and our goal is to estimate parameters\nw from labeled data. We abbreviate the score assigned to a labeling y for an input x as\nw>f (x; y) = Pj yj w>\nMatchings. Consider modeling the task of word alignment of parallel bilingual sen-\ntences (see Fig. 1(b)) as a maximum weight bipartite matching problem, where the nodes\nV = V s [ V t correspond to the words in the \u201csource\u201d sentence (V s) and the \u201ctarget\u201d sen-\ntence (V t) and the edges E = fjk : j 2 V s; k 2 V tg correspond to possible alignments\nbetween them. For simplicity, assume that each word aligns to one or zero words in the\nother sentence. The edge weight sjk represents the degree to which word j in one sentence\ncan translate into the word k in the other sentence. Our objective is to \ufb01nd an alignment that\nmaximizes the sum of edge scores. We represent a matching using a set of binary variables\n\nn fn(xj) (cid:0) Pjk2E yjkw>\n\ne fe(xjk), where yjk = 1I(yj 6= yk).\n\n(cid:5)\n(cid:6)\n(cid:7)\n(cid:10)\n(cid:5)\n(cid:12)\n(cid:10)\n(cid:13)\n(cid:13)\n(cid:14)\n(cid:10)\n(cid:13)\n(cid:15)\n(cid:15)\n(cid:8)\n(cid:9)\n(cid:16)\n(cid:7)\n(cid:14)\n(cid:8)\n(cid:8)\n(cid:6)\n(cid:7)\n(cid:17)\n(cid:9)\n(cid:12)\n(cid:8)\n(cid:18)\n(cid:7)\n(cid:7)\n(cid:9)\n(cid:8)\n(cid:19)\n(cid:7)\n(cid:11)\n(cid:18)\n(cid:13)\n(cid:11)\n(cid:13)\n(cid:21)\n(cid:9)\n(cid:7)\n(cid:22)\n(cid:8)\n(cid:12)\n(cid:8)\n(cid:15)\n(cid:8)\n(cid:6)\n(cid:9)\n(cid:13)\n(cid:17)\n(cid:22)\n(cid:8)\n(cid:15)\n(cid:15)\n(cid:8)\n(cid:6)\n(cid:11)\n(cid:18)\n(cid:13)\n(cid:11)\n(cid:13)\n(cid:6)\n(cid:13)\n(cid:9)\n(cid:6)\n(cid:23)\n(cid:7)\n(cid:24)\n(cid:17)\n(cid:8)\n(cid:15)\n(cid:8)\n(cid:15)\n(cid:8)\n(cid:7)\n(cid:10)\n(cid:11)\n(cid:17)\n(cid:12)\n(cid:8)\n(cid:7)\n(cid:11)\n(cid:8)\n(cid:18)\n(cid:10)\n(cid:8)\n(cid:13)\n(cid:9)\n(cid:7)\n(cid:12)\n(cid:8)\n(cid:7)\n(cid:15)\n(cid:8)\n(cid:6)\n(cid:7)\n(cid:12)\n(cid:18)\n(cid:13)\n\fyjk that are set to 1 if word j is assigned to word k in the other sentence, and 0 otherwise.\nThe score of an assignment is the sum of edge scores: s(y) = Pjk2E sjkyjk. The max-\nimum weight bipartite matching problem, arg maxy2Y s(y), can be found by solving the\nfollowing LP:\n\n0(cid:20)z(cid:20)1 X\nmax\n\njk2E\n\nsjkzjk\n\ns:t: X\n\nj2V s\n\nzjk (cid:20) 1; 8k 2 V t; X\n\nk2V t\n\nzjk (cid:20) 1; 8j 2 V s;\n\n(2)\n\nwhere again the continuous variables zjk correspond to the relaxation of the binary vari-\nables yjk. As in the min-cut problem, this LP is guaranteed to have integral solutions for\nany scoring function s(y) [17].\nFor word alignment, the scores sjk can be de\ufb01ned in terms of the word pair jk and input\nfeatures associated with xjk. We can include the identity of the two words, relative position\nin the respective sentences, part-of-speech tags, string similarity (for detecting cognates),\netc. We let sjk = w>f (xjk) for some user-provided feature mapping f and abbreviate\nw>f (x; y) = Pjk yjkw>f (xjk).\nGeneral structure. More generally, we consider prediction problems in which the in-\nput x 2 X is an arbitrary structured object and the output is a vector of values y =\n(y1; : : : ; yLx ), for example, a matching or a cut in the graph. We assume that the length\nLx and the structure of y depend deterministically on the input x. In our word alignment\nexample, the output space is de\ufb01ned by the length of the two sentences. Denote the output\nspace for a given input x as Y(x) and the entire output space as Y = Sx2X Y(x).\nConsider the class of structured prediction models H de\ufb01ned by the linear family: hw(x) =\narg maxy2Y(x) w>f (x; y); where f (x; y) is a vector of functions f : X (cid:2) Y 7! IRn. This\nformulation is very general. Indeed, it is too general for our purposes\u2014for many f ; Y pairs,\n\ufb01nding the optimal y is intractable. Below, we specialize to the class of models in which\nthe arg max problem can be solved in polynomial time using linear programming (and\nmore generally, convex optimization); this is still a very large class of models.\n\n1\n\nZw(x) expfw>f (x; y)g, where Zw(x) = Py\n\n3 Max-margin estimation\ni=1, where each instance consists\nWe assume a set of training instances S = f(xi; yi)gm\nof a structured object xi (such as a graph) and a target solution yi (such as a matching).\nConsider learning the parameters w in the conditional likelihood setting. We can de\ufb01ne\n02Y(x) expfw>f (x; y0)g,\nPw(y j x) =\nand maximize the conditional log-likelihood Pi log Pw(yi j xi), perhaps with additional\nregularization of the parameters w. However, computing the partition function Zw(x)\nis #P-complete [23, 10] for the two structured prediction problems we presented above,\nmatchings and min-cuts. Instead, we adopt the max-margin formulation of [20], which\ndirectly seeks to \ufb01nd parameters w such that: yi = arg maxy\n8i;\nwhere Yi = Y(xi) and yi denotes the appropriate vector of variables for example i. The\nsolution space Yi depends on the structured object xi; for example, the space of possible\nmatchings depends on the precise set of nodes and edges in the graph.\n\nw>f (xi; y0\n\ni2Yi\n\ni);\n\n0\n\nAs in univariate prediction, we measure the error of prediction using a loss function\ni). To obtain a convex formulation, we upper bound the loss \u2018(yi; hw(xi)) using\n\u2018(yi; y0\nthe hinge function: maxy\ni),\ni). Minimizing this upper bound will force the true structure yi to be\nand fi(y0\noptimal with respect to w for each instance i. We add a standard L2 weight penalty jjwjj2\n2C :\n\ni)] (cid:0) w>fi(yi); where \u2018i(y0\n\ni2Yi [w>fi(y0\n\ni) = f (xi; y0\n\ni) = \u2018(yi; y0\n\ni) + \u2018i(y0\n\n0\n\nmin\nw2W\n\njjwjj2\n2C\n\n+ X\n\ni\n\nmax\ni2Yi\n\n0\n\ny\n\n[w>fi(y0\n\ni) + \u2018i(y0\n\ni)] (cid:0) w>fi(yi);\n\n(3)\n\n\f0\n\n0\n\n0\n\ni) + \u2018i(y0\n\ni2Yi [w>fi(y0\n\nwhere C is a regularization parameter and W is the space of allowed weights (for example,\nW = IRn or W = IRn\n+). Note that this formulation is equivalent to the standard formulation\nusing slack variables (cid:24) and slack penalty C presented in [20, 19].\n(3) ef\ufb01ciently is the loss-augmented inference problem,\nThe key to solving Eq.\ni)]. This optimization problem has precisely the same form as\nmaxy\nthe prediction problem whose parameters we are trying to learn\u2014maxy\ni)\u2014\nbut with an additional term corresponding to the loss function. Tractability of the loss-\naugmented inference thus depends not only on the tractability of maxy\ni), but\ni). A natural choice in this regard is the Hamming\nalso on the form of the loss term \u2018i(y0\ndistance, which simply counts the number of variables in which a candidate solution y0\ni\ndiffers from the target output yi. In general, we need only assume that the loss function\ndecomposes over the variables in yi.\nFor example, in the case of bipartite matchings the Hamming loss counts the number of\ndifferent edges in the matchings yi and y0\ni) = Pjk yi;jk +\ni;jk)yi;jk: Thus the loss-augmented matching problem for example i can be\nPjk(1 (cid:0) 2y0\nwritten as an LP similar to Eq. (2) (without the constant term Pjk yi;jk):\n\ni and can be written as: \u2018H\n\ni2Yi w>fi(y0\n\ni2Yi w>fi(y0\n\ni (y0\n\n0(cid:20)z(cid:20)1 X\nmax\n\njk\n\nzi;jk[w>f (xi;jk) + 1 (cid:0) 2yi;jk]\n\ns:t: X\n\nj\n\nzi;jk (cid:20) 1; X\n\nk\n\nzi;jk (cid:20) 1:\n\n0\n\ni2Yi w>fi(y0\n\nGenerally, when we can express maxy\ni) as an LP, maxzi2Zi w>Fizi, where\nZi = fzi : Aizi (cid:20) bi; zi (cid:21) 0g, for appropriately de\ufb01ned constraints Ai; bi and fea-\nture matrix Fi, we have a similar LP for the loss-augmented inference for each exam-\nple i: di + maxzi2Zi(w>Fi + ci)>zi for appropriately de\ufb01ned di; Fi; ci; Ai; bi. Let\nz = fz1; : : : ; zmg, Z = Z1 (cid:2) : : : (cid:2) Zm.\nWe could proceed by making use of Lagrangian duality, which yields a joint convex opti-\nmization problem; this is the approach described in [19]. Instead we take a different tack\nhere, posing the problem in its natural saddle-point form:\n\nmin\nw2W\n\nmax\nz2Z\n\njjwjj2\n2C\n\n+ X\n\ni\n\n(cid:2)w>Fizi + c>\n\ni zi (cid:0) w>fi(yi)(cid:3) :\n\n(4)\n\nAs we discuss in the following section, this approach allows us to exploit the structure of\nW and Z separately, allowing for ef\ufb01cient solutions for a wider range of structure spaces.\n4 Extragradient method\nThe key operations of the method we present below are gradient calculations and Euclidean\nprojections. We let L(w; z) = jjwjj2\ni zi (cid:0) w>fi(yi)(cid:3) ; with gradients\ngiven by: rwL(w; z) = w\ni w + ci. We denote\nthe projection of a vector zi onto Zi as (cid:25)Zi(zi) = arg minz\ni (cid:0) zijj and similarly,\nthe projection onto W as (cid:25)W (w0) = arg minw2W jjw0 (cid:0) wjj.\nA well-known solution strategy for saddle-point optimization is provided by the extragra-\ndient method [11]. An iteration of the extragradient method consists of two very simple\nsteps, prediction (w; z) ! (wp; zp) and correction (wp; zp) ! (wc; zc):\n\nC +Pi Fizi (cid:0) fi(yi) and rzi L(w; z) = F>\ni2Zi jjz0\n\n2C +Pi (cid:2)w>Fizi + c>\n\n0\n\nwp = (cid:25)W (w (cid:0) (cid:12)rwL(w; z));\nwc = (cid:25)W (w (cid:0) (cid:12)rwL(wp; zp));\n\nzp\ni = (cid:25)Zi (zi + (cid:12)rzi L(w; z));\ni = (cid:25)Zi (zi + (cid:12)rzi L(wp; zp));\nzc\n\n(5)\n(6)\n\nwhere (cid:12) is an appropriately chosen step size. The algorithm starts with a feasible point\nw = 0, zi\u2019s that correspond to the assignments yi\u2019s and step size (cid:12) = 1. After each pre-\ndiction step, it computes r = (cid:12) jjrL(w;z)(cid:0)rL(w\n: If r is greater than a threshold (cid:23), the\npjj+jjz(cid:0)z\n\np;z\npjj)\n\n(jjw(cid:0)w\n\np)jj\n\n\fstep size is decreased using an Armijo type rule: (cid:12) = (2=3)(cid:12) min(1; 1=r), and a new pre-\ndiction step is computed until r (cid:20) (cid:23), where (cid:23) 2 (0; 1) is a parameter of the algorithm. Once\na suitable (cid:12) is found, the correction step is taken and (wc; zc) becomes the new (w; z). The\nmethod is guaranteed to converge linearly to a solution w(cid:3); z(cid:3) [11, 9]. See the longer ver-\nsion of this paper at http://www.cs.berkeley.edu/\u02dctaskar/extragradient.pdf\nfor details. By comparison, Exponentiated Gradient [4] has sublinear convergence rate\nguarantees, while Structured SMO [18] has none.\n\nThe key step in\ufb02uencing the ef\ufb01ciency of the algorithm is the Euclidean projection onto\nthe feasible sets W and Zi. In case W = IRn, the projection is the identity operation;\nprojecting onto IRn\n+ consists of clipping negative weights to zero. Additional problem-\nspeci\ufb01c constraints on the weight space can be ef\ufb01ciently incorporated in this step (although\nlinear convergence guarantees only hold for polyhedral W). In case of word alignment, Zi\nis the convex hull of bipartite matchings and the problem reduces to the much-studied\nminimum cost quadratic \ufb02ow problem. The projection zi = (cid:25)Zi (z0\n\ni) is given by\n\n0(cid:20)z(cid:20)1 X\nmin\n\njk\n\n1\n2\n\n(z0\n\ni;jk (cid:0) zi;jk)2\n\ns:t: X\n\nj\n\nzi;jk (cid:20) 1; X\n\nk\n\nzi;jk (cid:20) 1:\n\n2 (z0\n\ni onto Zi.\n\nIn case of word alignment,\n\nWe use a standard reduction of bipartite matching to min-cost \ufb02ow by introducing a source\nnode s linked to all the nodes in V s\ni (words in the \u201csource\u201d sentence), and a sink node t\ni (words in the \u201ctarget\u201d sentence), using edges of capacity 1\nlinked from all the nodes in V t\nand cost 0. The original edges jk have a quadratic cost 1\ni;jk (cid:0) zi;jk)2 and capacity 1.\nMinimum (quadratic) cost \ufb02ow from s to t is the projection of z0\nThe reduction of the projection to minimum quadratic cost \ufb02ow for the min-cut\npolytope Zi\nis shown in the longer version of the paper. Algorithms for solving\nthis problem are nearly as ef\ufb01cient as those for solving regular min-cost \ufb02ow prob-\nlems.\nthe running time scales with the cube of the\nsentence length. We use publicly-available code for solving this problem [8] (see\nhttp://www.math.washington.edu/\u02dctseng/netflowg_nl/).\n5 Experiments\nWe investigate two structured models we described above: bipartite matchings for word\nalignments and restricted potential Markov nets for 3D segmentation. A commercial QP-\nsolver, MOSEK, runs out of memory on the problems we describe below using the QP\nformulation [19]. We compared the extragradient method with the averaged perceptron\nalgorithm [6]. A question which arises in practice is how to choose the regularization\nparameter C. The typical approach is to run the algorithm for several values of the reg-\nularization parameter and pick the best model using a validation set. For the averaged\nperceptron, a standard method is to run the algorithm tracking its performance on a valida-\ntion set, and selecting the model with best performance. We use the same training regime\nfor the extragradient by running it with C = 1.\nObject segmentation. We test our algorithm on a 3D scan segmentation problem us-\ning the class of Markov networks with potentials that were described above. The dataset\nis a challenging collection of cluttered scenes containing articulated wooden puppets [2].\nIt contains eleven different single-view scans of three puppets of varying sizes and posi-\ntions, with clutter and occluding objects such as rope, sticks and rings. Each scan con-\nsists of around 7; 000 points. Our goal was to segment the scenes into two classes\u2014\npuppet and background. We use \ufb01ve of the scenes for our training data, three for val-\nidation and three for testing. Sample scans from the training and test set can be seen at\nhttp://www.cs.berkeley.edu/\u02dctaskar/3DSegment/. We computed spin images\nof size 10 (cid:2) 5 bins at two different resolutions, then scaled the values and performed PCA\nto obtain 45 principal components, which comprised our node features. We used the sur-\nface links output by the scanner as edges between points and for each edge only used a\n\n\f0.2\n\n0.15\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0.1\n\n0.05\n\n0.2\n\npercep \u2212 error\nextrag \u2212 error\n\nextrag \u2212 loss\n\n0.15\n\ns\ne\nd\no\nn\n#\n\n \n\n \n/\n \ns\ns\no\nL\nn\na\nr\nT\n\n \n\ni\n\n0.1\n\n0.05\n\n0.15\n\n0.14\n\n0.13\n\n0.12\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\nR\nE\nA\n\n \nt\ns\ne\nT\n\npercep \u2212 AER\nextrag \u2212 AER\n\nextrag \u2212 loss\n\n0.15\n\n0.14\n\n0.13\n\n0.12\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\ns\ne\ng\nd\ne\n#\n\n \n\n \n/\n \ns\ns\no\nL\nn\na\nr\nT\n\n \n\ni\n\n0\n0\n0\n\n100\n100\n\n200\n200\n\n300\n300\n\nIterations\n\n(a)\n\n400\n400\n\n500\n500\n\n0\n600\n600\n\n0.06\n0\n0\n\n100\n100\n\n200\n200\n\n400\n400\n\n500\n500\n\n0.06\n\n600\n600\n\n300\n300\n\nIterations\n(b)\n\nFigure 2: Both plots show test error for the averaged perceptron and the extragradient (left\ny-axis) and training loss per node or edge for the extragradient (right y-axis) versus number\nof iterations for (a) object segmentation task and (b) word alignment task.\n\njAj+jSj\n\nsingle feature, set to a constant value of 1 for all edges. This results in all edges having the\nsame potential. The training data contains approximately 37; 000 nodes and 88; 000 edges.\nTraining time took about 4 hours for 600 iterations on a 2.80GHz Pentium 4 machine.\nFig. 2(a) shows that the extragradient has a consistently lower error rate (about 3% for ex-\ntragradient, 4% for averaged perceptron), using only slightly more expensive computations\nper iteration. Also shown is the corresponding decrease in the hinge-loss upperbound on\nthe training data as the extragradient progresses.\nWord alignment. We also tested our learning algorithm on word-level alignment using a\ndata set from the 2003 NAACL set [15], the English-French Hansards task. This corpus\nconsists of 1.1M automatically aligned sentences, and comes with a validation set of 39\nsentence pairs and a test set of 447 sentences. The validation and test sentences have been\nhand-aligned and are marked with both sure and possible alignments. Using these align-\nments, alignment error rate (AER) is calculated as: AER(A; S; P ) = 1 (cid:0) jA\\Sj+jA\\P j\n:\nHere, A is a set of proposed index pairs, S is the set of sure gold pairs, and P is the set of\npossible gold pairs (where S (cid:18) P ).\nWe used the intersection of the predictions of the English-to-French and French-to-English\nIBM Model 4 alignments (using GIZA++ [16]) on the \ufb01rst 5000 sentence pairs from the\n1.1M sentences. The number of edges for 5000 sentences was about 555,000. We tested\non the 347 hand-aligned test examples, and used the validation set to select the stopping\npoint. The features on the word pair (ej; fk) include measures of association, orthography,\nrelative position, predictions of generative models (see [22] for details). It took about 3\nhours to perform 600 training iterations on the training data using a 2.8GHz Pentium 4\nmachine. Fig. 2(b) shows the extragradient performing slightly better (by about 0.5%)\nthan average perceptron.\n6 Conclusion\nWe have presented a general solution strategy for large-scale structured prediction prob-\nlems. We have shown that these problems can be formulated as saddle-point optimization\nproblems, problems that are amenable to solution by the extragradient algorithm. Key\nto our approach is the recognition that the projection step in the extragradient algorithm\ncan be solved by network \ufb02ow algorithms. Network \ufb02ow algorithms are among the most\nwell-developed in the \ufb01eld of combinatorial optimization, and yield stable, ef\ufb01cient al-\ngorithmic platforms. We have exhibited the favorable scaling of this overall approach in\ntwo concrete, large-scale learning problems. It is also important to note that the general\napproach extends to a much broader class of problems. In [21], we show how to apply\nthis approach ef\ufb01ciently to other types of models, including general Markov networks and\nweighted context-free grammars, using Bregman projections.\n\n\fAcknowledgments\nWe thank Paul Tseng for kindly answering our questions about his min-cost \ufb02ow code.\nThis work was funded by the DARPA CALO project (03-000219) and Microsoft Research\nMICRO award (05-081). SLJ was also supported by an NSERC graduate sholarship.\nReferences\n[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines.\n\nIn\n\nProc. ICML, 2003.\n\n[2] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A. Ng. Discrimi-\n\nnative learning of Markov random \ufb01elds for segmentation of 3d scan data. In CVPR, 2005.\n\n[3] P. Baldi, J. Cheng, and A. Vullo. Large-scale prediction of disulphide bond connectivity. In\n\nProc. NIPS, 2004.\n\n[4] P. Bartlett, M. Collins, B. Taskar, and D. McAllester. Exponentiated gradient algorithms for\n\nlarge-margin structured classi\ufb01cation. In NIPS, 2004.\n\n[5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape\n\ncontexts. IEEE Trans. Pattern Anal. Mach. Intell., 24, 2002.\n\n[6] M. Collins. Discriminative training methods for hidden Markov models: Theory and experi-\n\nments with perceptron algorithms. In Proc. EMNLP, 2002.\n\n[7] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for\n\nbinary images. J. R. Statist. Soc. B, 51, 1989.\n\n[8] F. Guerriero and P. Tseng. Implementation and test of auction methods for solving general-\nized network \ufb02ow problems with separable convex cost. Journal of Optimization Theory and\nApplications, 115(1):113\u2013144, October 2002.\n\n[9] B.S. He and L. Z. Liao. Improvements of some projection methods for monotone nonlinear\n\nvariational inequalities. JOTA, 112:111:128, 2002.\n\n[10] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model.\n\nSIAM J. Comput., 22, 1993.\n\n[11] G. M. Korpelevich. The extragradient method for \ufb01nding saddle points and other problems.\n\nEkonomika i Matematicheskie Metody, 12:747:756, 1976.\n\n[12] S. Kumar and M. Hebert. Discriminative \ufb01elds for modeling spatial dependencies in natural\n\nimages. In NIPS, 2003.\n\n[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In ICML, 2001.\n\n[14] E. Matusov, R. Zens, and H. Ney. Symmetric word alignments for statistical machine transla-\n\ntion. In Proc. COLING, 2004.\n\n[15] R. Mihalcea and T. Pedersen. An evaluation exercise for word alignment. In Proceedings of\nthe HLT-NAACL 2003 Workshop, Building and Using parallel Texts: Data Driven Machine\nTranslation and Beyond, pages 1\u20136, Edmonton, Alberta, Canada, 2003.\n\n[16] F. Och and H. Ney. A systematic comparison of various statistical alignment models. Compu-\n\ntational Linguistics, 29(1), 2003.\n\n[17] A. Schrijver. Combinatorial Optimization: Polyhedra and Ef\ufb01ciency. Springer, 2003.\n[18] B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis,\n\nStanford University, 2004.\n\n[19] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models:\n\na large margin approach. In ICML, 2005.\n\n[20] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In NIPS, 2003.\n[21] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and\n\nBregman projections. Technical report, UC Berkeley Statistics Department, 2005.\n\n[22] B. Taskar, S. Lacoste-Julien, and D. Klein. A discriminative matching approach to word align-\n\nment. In EMNLP, 2005.\n\n[23] L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science,\n\n8:189\u2013201, 1979.\n\n\f", "award": [], "sourceid": 2794, "authors": [{"given_name": "Ben", "family_name": "Taskar", "institution": null}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}