{"title": "More data means less inference: A pseudo-max approach to structured learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2181, "page_last": 2189, "abstract": "The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference in this setting are intractable. Here we show that it is possible to circumvent this difficulty when the input distribution is rich enough via a method similar in spirit to pseudo-likelihood. We show how our new method achieves consistency, and illustrate empirically that it indeed performs as well as exact methods when sufficiently large training sets are used.", "full_text": "More data means less inference: A pseudo-max\n\napproach to structured learning\n\nDavid Sontag\n\nMicrosoft Research\n\nOfer Meshi\n\nHebrew University\n\nTommi Jaakkola\n\nCSAIL, MIT\n\nAmir Globerson\nHebrew University\n\nAbstract\n\nThe problem of learning to predict structured labels is of key importance in many\napplications. However, for general graph structure both learning and inference are\nintractable. Here we show that it is possible to circumvent this dif\ufb01culty when\nthe distribution of training examples is rich enough, via a method similar in spirit\nto pseudo-likelihood. We show that our new method achieves consistency, and\nillustrate empirically that it indeed approaches the performance of exact methods\nwhen suf\ufb01ciently large training sets are used.\n\nMany prediction problems in machine learning applications are structured prediction tasks. For\nexample, in protein folding we are given a protein sequence and the goal is to predict the protein\u2019s\nnative structure [14]. In parsing for natural language processing, we are given a sentence and the goal\nis to predict the most likely parse tree [2]. In these and many other applications, we can formalize the\nstructured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting\ny (e.g., structure, parse) according to y = arg max\u02c6y\u2208Y \u03b8 \u00b7 \u03c6(x, \u02c6y), where \u03c6(x, y) is a function that\nmaps any input and a candidate assignment to a feature vector, Y denotes the space of all possible\nassignments to the vector y, and \u03b8 is a weight vector to be learned.\nThis paper addresses the problem of learning structured prediction models from data. In particular,\ngiven a set of labeled examples {(xm, ym)}M\nm=1, our goal is to \ufb01nd a vector \u03b8 such that for each\nexample m, ym = arg maxy\u2208Y \u03b8 \u00b7 \u03c6(xm, y), i.e. one which separates the training data. For many\nstructured prediction models, maximization over Y is computationally intractable. This makes it\ndif\ufb01cult to apply previous algorithms for learning structured prediction models, such as structured\nperceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making\na prediction at every iteration (equivalent to repeatedly solving an integer linear program).\nGiven training data, we can consider the space of parameters \u0398 that separate the data. This space can\nbe de\ufb01ned by the intersection of a large number of linear inequalities. A recent approach to getting\naround the hardness of prediction is to use linear programming (LP) relaxations to approximate the\nmaximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger\nconstraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish\nitself also from possible fractional vertexes that arise due to the relaxation. The relaxations can\ntherefore be understood as optimizing over an inner bound of \u0398. This set may be empty even if\nthe training data is separable with exact inference [6]. Another obstacle to using LP relaxations for\nlearning is that solving the LPs can be very slow.\nIn this paper we ask whether it is possible to learn while avoiding inference altogether. We propose\na new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of\n\u0398. Learning involves optimizing over only a small number of constraints per data point, and thus\ncan be performed quickly, even for complex structured prediction models. We show that, if the data\nit will \ufb01nd some \u03b8 \u2208 \u0398. This\navailable for learning is \u201cnice\u201d, this algorithm is consistent, i.e.\nis an example of how having the right data can circumvent the hardness of learning for structured\nprediction.\n\n1\n\n\fWe also investigate the limitations of the proposed method. We show that the problem of even\ndeciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no\neasier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time\nalgorithm, to always succeed at learning from an arbitrary \ufb01nite data set. To our knowledge, this is\nthe \ufb01rst result characterizing the hardness of exact learning for structured prediction.\nFinally, we show empirically that our algorithm allows us to successfully learn the parameters for\nboth multi-label prediction and protein side-chain placement. The performance of the algorithm is\nimproved as more data becomes available, as our theoretical results anticipate.\n\n1 Pseudo-Max method\nWe consider the general structured prediction problem. The input space is denoted by X and the set\nof all possible assignments by Y. Each y \u2208 Y corresponds to n variables y1, . . . , yn, each with k\npossible states. The classi\ufb01er uses a (given) function \u03c6(x, y) : X ,Y \u2192 Rd and (learned) weights\n\u03b8 \u2208 Rd, and is de\ufb01ned as y(x; \u03b8) = arg max\u02c6y\u2208Y f(\u02c6y; x, \u03b8) where f is the discriminant function\nf(y; x, \u03b8) = \u03b8 \u00b7 \u03c6(x, y). Our analysis will focus on functions \u03c6 whose scope is limited to small\nsets of the yi variables, but for now we keep the discussion general.\nset of separating weight vectors, \u0398 =(cid:8)\u03b8 | \u2200m, y \u2208 Y, f(ym; xm, \u03b8) \u2265 f(y; xm, \u03b8)+e(y, ym)(cid:9).\nGiven a set of labeled examples {(xm, ym)}M\nm=1, the goal of the typical learning problem is to \ufb01nd\nweights \u03b8 that correctly classify the training examples. Consider \ufb01rst the separable case. De\ufb01ne the\n\ne is a loss function (e.g., zero-one or Hamming) such that e(ym, ym) = 0 and e(y, ym) > 0\nfor y (cid:54)= ym, which serves to rule out the trivial solution \u03b8 = 0.1 The space \u0398 is de\ufb01ned by\nexponentially many constraints per example, one for each competing assignment.\nIn this work we consider a much simpler set of constraints where, for each example, we only consider\nthe competing assignments obtained by modifying a single label yi, while \ufb01xing the other labels to\ntheir value at ym. The pseudo-max set, which is an outer bound on \u0398, is given by\n\u2212i, yi; xm, \u03b8) + e(yi, ym\n\n\u0398ps =(cid:8)\u03b8 | \u2200m, i, yi, f(ym; xm, \u03b8) \u2265 f(ym\n\ni )(cid:9).\n\n(1)\n\n\u2212i denotes the label ym without the assignment to yi.\n\nHere ym\nWhen the data is not separable, \u0398 will be the empty set. Instead, we may choose to minimize the\n\n(cid:2)f(y; xm, \u03b8) \u2212 f(ym; xm, \u03b8) + e(y, ym)(cid:3), which can be shown to\n\nhinge loss, (cid:96)(\u03b8) =(cid:80)\n\nbe an upper bound on the training error [13]. When the data is separable, min\u03b8 (cid:96)(\u03b8) = 0. Note that\nregularization may be added to this objective.\nThe corresponding pseudo-max objective replaces the maximization over all of y with maximization\nover a single variable yi while \ufb01xing the other labels to their value at ym:2,3\n\nm maxy\n\nM(cid:88)\n\nn(cid:88)\n\nm=1\n\ni=1\n\n(cid:2)f(ym\ni )(cid:3) .\n\u2212i, yi; xm, \u03b8) \u2212 f(ym; xm, \u03b8) + e(yi, ym\n\nmax\nyi\n\n(cid:96)ps(\u03b8) =\n\n(2)\n\nAnalogous to before, we have min\u03b8 (cid:96)ps(\u03b8) = 0 if and only if \u03b8 \u2208 \u0398ps.\nThe objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likeli-\nhood estimation of parameters of Markov random \ufb01elds (MRFs) [1]. The pseudo-likelihood estimate\nis provably consistent when the data generating distribution is a MRF of the same structure as used\nin the pseudo-likelihood objective. However, our setting is different since we only get to view the\nmaximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be\npaired with the same y rather than samples y drawn from the conditional distribution p(y|x; \u03b8).\nThe pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5].\nIn the latter, the learning problem is solved by repeatedly looking for assignments that violate the\nseparability constraint (or its hinge version). Our constraints can be viewed as using a very small\n\n1An alternative formulation, which we use in the next section, is to break the symmetry by having part of\n\nthe input not be multiplied by any weight. This will also rule out the trivial solution \u03b8 = 0.\n\n2It is possible to use maxi instead ofP\n\ni, and some of our consistency results will still hold.\n\n3The pseudo-max approach is markedly different from a learning method which predicts each label yi\n\nindependently, since the objective considers all i simultaneously (both at learning and test time).\n\n2\n\n\fFigure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by con\ufb01gurations y(x)\nfor some J\u2217 > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that\nhold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing con\ufb01gurations\ny = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component\nof the (cid:96)ps(J ) function (see Eq. 9). The function is shown for different values of c1, the mean of the x1 variable.\n\nsubset of assignments for the set of candidate constraint violators. We also note that when exact\nmaximization over the discriminant function f(y; x, \u03b8) is hard, the standard cutting plane algorithm\ncannot be employed since it is infeasible to \ufb01nd a violated constraint. For the pseudo-max objective,\n\ufb01nding a constraint violation is simple and linear in the number of variables.4\nIt is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield\na consistent estimate of \u03b8, even in the separable case. However, as we show, consistency can be\nshown to be achieved under particular assumptions on the data generating distribution p(x).\n\n2 Consistency of the Pseudo-Max method\n\nIn this section we show that if the feature generating distribution p(x) satis\ufb01es particular assump-\ntions, then the pseudo-max approach yields a consistent estimate. In other words, if the training\ndata is of the form {(xm, y(xm; \u03b8\u2217))}M\nm=1 for some true parameter vector \u03b8\u2217, then as M \u2192 \u221e the\nminimum of the pseudo-max objective will converge to \u03b8\u2217 (up to equivalence transformations).\nThe section is organized as follows. First, we provide intuition for the consistency results by con-\nsidering a model with only two variables. Then, in Sec. 2.1, we show that any parameter \u03b8\u2217 can\nbe identi\ufb01ed to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm).\nThis in itself proves consistency, as long as there is a non-zero probability of sampling this set. In\nSec. 2.2 we give a more direct proof of consistency by using strict convexity arguments.\nFor ease of presentation, we shall work with a simpli\ufb01ed instance of the structured learning setting.\nWe focus on binary variables, yi \u2208 {0, 1}, and consider discriminant functions corresponding to\nIsing models, a special case of pairwise MRFs (J denotes the vector of \u201cinteraction\u201d parameters):\n\nf(y; x, J) =(cid:80)\n\nij\u2208E Jijyiyj +(cid:80)\n\ni yixi\n\n(3)\n\nThe singleton potential for variable yi is yixi and is not dependent on the model parameters. We\ncould have instead used Jiyixi, which would be more standard. However, this would make the\nparameter vector J invariant to scaling, complicating the identi\ufb01ability analysis. In the consistency\nanalysis we will assume that the data is generated using a true parameter vector J\u2217. We will show\nthat as the data size goes to in\ufb01nity, minimization of (cid:96)ps(J) yields J\u2217.\nWe begin with an illustrative analysis of the pseudo-max constraints for a model with only two vari-\nables, i.e. f(y; x, J) = Jy1y2 + y1x1 + y2x2. The purpose of the analysis is to demonstrate general\nprinciples for when pseudo-max constraints may succeed or fail. Assume that training samples are\ngenerated via y(x) = argmaxy f(y; x, J\u2217). We can partition the input space X into four regions,\n{x \u2208 X : y(x) = \u02c6y} for each of the four con\ufb01gurations \u02c6y, shown in Fig. 1 (left). The blue lines\noutline the exact decision boundaries of f(y; x, J\u2217), with the lines being given by the constraints\n\n4The methods differ substantially in the non-separable setting where we minimize (cid:96)ps(\u03b8), using a slack\n\nvariable for every node and example, rather than just one slack variable per example as in (cid:96)(\u03b8).\n\n3\n\nx2x1J\u2217+x1+x2=0J\u2217+x2=0J\u2217+x1=0x1=0x2=0y=(0,0)y=(1,0)y=(0,1)y=(1,1)(cid:239)1(cid:239)0.500.5100.050.10.150.2Jg(J12) c1=0c1=1c1=(cid:239)1\fin \u0398 that hold with equality. The red lines denote the pseudo-max constraints in \u0398ps that hold with\nequality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical.\nWe can identify J\u2217 by obtaining samples x = (x1, x2) that explore both sides of one of the decision\nboundaries that depends on J\u2217. The pseudo-max constraints will fail to identify J\u2217 if the samples\ndo not suf\ufb01ciently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0)\nand y = (1, 1). This can happen, for example, when the input samples are dependent, giving only\nrise to the con\ufb01gurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision\nline J\u2217 + x1 + x2 = 0, pseudo-max can only tell that they respect J\u2217 + x1 \u2265 0 and J\u2217 + x2 \u2265 0\n(dashed red lines), or x1 \u2264 0 and x2 \u2264 0 for points labeled (0, 0).\nOnly constraints that depend on the parameter are effective for learning. For pseudo-max to be able\nto identify J\u2217, the input samples must be continuous, densely populating the two parameter depen-\ndent decision lines that pseudo-max can use. The two point sets in the \ufb01gure illustrate good and bad\ninput distributions for pseudo-max. The diagonal set would work well with the exact constraints but\nbadly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution\non the right, populating the J\u2217 + x2 = 0 decision line, would permit pseudo-max to identify J\u2217.\n\n2.1\n\nIdenti\ufb01ability of True Parameters\n\nf(y; x, \u03b8) =(cid:80)\n\nij\u2208E \u03b8ij(yi, yj) +(cid:80)\n\ni \u03b8i(yi) +(cid:80)\n\nIn this section, we show that it is possible to approximately identify the true model parameters, up to\nmodel equivalence, using the pseudo-max constraints and a carefully chosen linear number of data\npoints. Consider the learning problem for structured prediction de\ufb01ned on a \ufb01xed graph G = (V, E)\nwhere the parameters to be learned are pairwise potential functions \u03b8ij(yi, yj) for ij \u2208 E and single\nnode \ufb01elds \u03b8i(yi) for i \u2208 V . We consider discriminant functions of the form\n\ni xi(yi),\n\n(4)\nwhere the input space X = R|V |k speci\ufb01es the single node potentials. Without loss of generality, we\nremove the additional degrees of freedom in \u03b8 by restricting it to be in a canonical form: \u03b8 \u2208 \u0398can\nif for all edges \u03b8ij(yi, yj) = 0 whenever yi = 0 or yj = 0, and if for all nodes, \u03b8i(yi) = 0 when\nyi = 0. As a result, assuming the training set comes from a model in this class, and the input \ufb01elds\nxi(yi) exercise the discriminant function appropriately, we can hope to identify \u03b8\u2217 \u2208 \u0398can. Indeed,\nwe show that, for some data sets, the pseudo-max constraints are suf\ufb01cient to identify \u03b8\u2217.\nLet \u0398ps({ym, xm}) be the set of parameters that satisfy the pseudo-max classi\ufb01cation constraints\ni , f(ym; xm, \u03b8) \u2265 f(ym\n(5)\nFor simplicity we omit the margin losses e(ym\ni , yi), since the input \ufb01elds xi(yi) already suf\ufb01ce to\nrule out the trivial solution \u03b8 = 0.\nProposition 2.1. For any \u03b8\u2217 \u2208 \u0398can, there is a set of 2|V |(k \u2212 1) + 2|E|(k \u2212 1)2 examples,\n{xm, y(xm; \u03b8\u2217)}, such that any pseudo-max consistent \u03b8 \u2208 \u0398ps({ym, xm}) \u2229 \u0398can is arbitrarily\nclose to \u03b8\u2217.\n\n\u0398ps({ym, xm}) =(cid:8)\u03b8 | \u2200m, i, yi (cid:54)= ym\n\n\u2212i, yi; xm, \u03b8)(cid:9).\n\nThe proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler\nbinary discriminant function discussed in Eq. 3. Note that the binary model is already in the canon-\nical form since Jijyiyj = 0 whenever yi = 0 or yj = 0. For any ij \u2208 E, we show how to choose\ntwo input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these\ntwo examples will have Jij \u2208 [J\u2217ij \u2212 \u0001, J\u2217ij + \u0001]. Repeating this for all of the edge parameters then\ngives the complete set of examples. The input examples we need for this will depend on J\u2217.\nFor the \ufb01rst example, we set the input \ufb01elds for all neighbors of i (except j) in such a way that\nk < \u2212|N(k)| maxl |J\u2217kl| for\nwe force the corresponding labels to be zero. More formally, we set x1\nk \u2208 N(i)\\j, resulting in y1\nj to a large value, e.g.\ni = \u2212J\u2217ij + \u0001 so as to obtain a\nj > |N(j)| maxl |J\u2217jl|, so that y1\nx1\ni = 1. All other input \ufb01elds can be set arbitrarily. As a result, the pseudo-max\nslight preference for y1\nconstraints pertaining to node i are f(y1; x1, J) \u2265 f(y1\n\u2212i, yi; x1, J) for yi = 0, 1. By taking into\ni and its neighbors, and by removing terms that are the same on\naccount the label assignments for y1\nboth sides of the equation, we get Jij + x1\nj, which, for yi = 0, implies\ni \u2265 0 or Jij \u2212 J\u2217ij + \u0001 \u2265 0. The second example x2 differs only in terms of the input\nthat Jij + x1\ni = 0. This gives Jij \u2264 J\u2217ij + \u0001, as desired.\n\ufb01eld for i. In particular, we set x2\n\nk = 0, where y1 = y(x1). In contrast, we set x1\n\nj = 1. Finally, for node i, we set x1\n\ni = \u2212J\u2217ij \u2212 \u0001 so that y2\n\ni + x1\n\ni + x1\n\nj \u2265 Jijyi + yix1\n\n4\n\n\f2.2 Consistency via Strict Convexity\n\nIn this section we prove the consistency of the pseudo-max approach by showing that it corresponds\nto minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x \u2208\nRn (a simple example being a multi-variate Gaussian) and that J\u2217 is \ufb01nite. We use a discriminant\nfunction as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that\nym are obtained via ym = arg maxy f(y; xm, J\u2217). We can write the (cid:96)ps(J) objective for \ufb01nite\ndata, and its limit when M \u2192 \u221e, compactly as:\n\n(cid:96)ps(J) =\n\n1\nM\n\n\u2192 (cid:88)\n\ni\n\n(cid:88)\n\ni\n\n(cid:88)\n(cid:90)\n\nm\n\nmax\nyi\n\np(x) max\n\nyi\n\ni + (cid:88)\n(cid:2)(yi \u2212 ym\ni )(cid:0)xm\n(cid:2)(yi \u2212 yi(x))(cid:0)xi + (cid:88)\n\nk\u2208N (i)\n\nJkiym\nk\n\n(cid:1)(cid:3)\nJkiyk(x)(cid:1)(cid:3)dx\n\nk\u2208N (i)\n\n(6)\n\n(cid:90) \u221e\n\n\u2212\u221e\n\n(cid:90) \u221e\n\n\u2212\u221e\n\ngik(Jik)(cid:3),\n\nwhere gik(Jik) =(cid:82)\n\n(cid:2) \u02c6gi({Jik}k\u2208N (i)) + (cid:88)\n\nwhere yi(x) is the label of i for input x when using parameters J\u2217. Starting from the above,\nconsider the terms separately for each i. We partition the integral over x \u2208 Rn into exclusive\nregions according to the predicted labels of the neighbors of i (given x). De\ufb01ne Sij = {x : yj(x) =\n1 and yk(x) = 0 for k \u2208 N(i)\\j}. Eq. 6 can then be written as\n\n(cid:96)ps(J) =(cid:88)\n(7)\np(x) maxyi[(yi\u2212yi(x))(xi +Jik)]dx and \u02c6gi({Jik}k\u2208N (i)) contains all of\nthe remaining terms, i.e. where either zero or more than one neighbor is set to one. The function \u02c6gi\nis convex in J since it is a sum of integrals over convex functions. We proceed to show that gik(Jik)\nis strictly convex for all choices of i and k \u2208 N(i). This will show that (cid:96)ps(J) is strictly convex\nsince it is a sum over functions strictly convex in each one of the variables in J.\nFor all values xi \u2208 (\u2212\u221e,\u221e) there is some x in Sij. This is because for any \ufb01nite xi and \ufb01nite J\u2217,\nthe other xj\u2019s can be chosen so as to give the y con\ufb01guration corresponding to Sij. Now, since p(x)\nhas full support, we have P (Sij) > 0 and p(x) > 0 for any x in Sij. As a result, this also holds for\nthe marginal pi(xi|Sij) over xi within Sij. After some algebra, we obtain:\n\nk\u2208N (i)\n\nx\u2208Sik\n\ni\n\ngij(Jij) = P (Sij)\n\npi(xi|Sij) max [0, xi + Jij] dxi \u2212\n\np(x)yi(x)(xi + Jij)dx\n\nThe integral over the yi(x)(xi + Jij) expression just adds a linear term to gij(Jij). The relevant\nremaining term is (for brevity we drop P (Sij), a strictly positive constant, and the ij index):\n\nh(J) =\n\npi(xi|Sij) max [0, xi + J] dxi =\n\npi(xi|Sij)\u02c6h(xi, J)dxi\n\n(8)\n\nwhere we de\ufb01ne \u02c6h(xi, J) = max [0, xi + J]. Note that h(J) is convex since \u02c6h(xi, J) is convex in J\nfor all xi. We want to show that h(J) is strictly convex. Consider J(cid:48) < J and \u03b1 \u2208 (0, 1) and de\ufb01ne\nthe interval I = [\u2212J,\u2212\u03b1J \u2212 (1 \u2212 \u03b1)J(cid:48)]. For xi \u2208 I it holds that: \u03b1\u02c6h(xi, J) + (1 \u2212 \u03b1)\u02c6h(xi, J(cid:48)) >\n\u02c6h(xi, \u03b1J + (1 \u2212 \u03b1)J(cid:48)) (since the \ufb01rst term is strictly positive and the rest are zero). For all other x,\nthis inequality holds but is not necessarily strict (since \u02c6h is always convex in J). We thus have after\nintegrating over x that \u03b1h(J) + (1 \u2212 \u03b1)h(J(cid:48)) > h(\u03b1J + (1 \u2212 \u03b1)J(cid:48)), implying h is strictly convex,\nas required. Note that we used the fact that p(x) has full support when integrating over I.\nThe function (cid:96)ps(J) is thus a sum of strictly convex functions in all its variables (namely g(Jik))\nplus other convex functions of J, hence strictly convex. We can now proceed to show consistency.\nBy strict convexity, the pseudo-max objective is minimized at a unique point J. Since we know\nthat (cid:96)ps(J\u2217) = 0 and zero is a lower bound on the value of (cid:96)ps(J), it follows that J\u2217 is the unique\nminimizer. Thus we have that as M \u2192 \u221e, the minimizer of the pseudo-max objective is the true\nparameter vector, and thus we have consistency.\nAs an example, consider the case of two variables y1, y2, with x1 and x2 distributed according to\nN (c1, 1),N (0, 1) respectively. Furthermore assume J\u221712 = 0. Then simple direct calculation yields:\n(9)\n\n(cid:90) \u2212c1\n\ne\u2212(J12+c1)2/2\n\n1/2 +\n\ne\u2212c2\n\ng(J12) = c1 + J12\u221a\n2\u03c0\n\ne\u2212x2/2dx \u2212 1\u221a\n2\u03c0\n\n\u2212J12\u2212c1\n\nwhich is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration).\n\n1\u221a\n2\u03c0\n\n(cid:90)\n\nx\u2208Sij\n\n(cid:90) \u221e\n\n\u2212\u221e\n\n5\n\n\f3 Hardness of Structured Learning\n\nMost structured prediction learning algorithms use some form of inference as a subroutine. However,\nthe corresponding prediction task is generally NP-hard. For example, maximizing the discriminant\nfunction de\ufb01ned in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This\nraises the question of whether it is possible to bypass prediction during learning. Although prediction\nmay be intractable for arbitrary MRFs, what does this say about the dif\ufb01culty of learning with a\npolynomial number of data points? In this section, we show that the problem of deciding whether\nthere exists a parameter vector that separates the training data is NP-hard.\nPut in the context of the positive results in this paper, these hardness results show that, although in\nsome cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certi\ufb01cate\nof optimality. Put differently, although the pseudo-max constraints in the separable case always give\nan outer bound on \u0398 (and may even be a single point), \u0398 could be the empty set \u2013 and we would\nnever know the difference.\nTheorem 3.1. Given labeled examples {(xm, ym)}M\nm=1 for a \ufb01xed but arbitrary graph G, it is\nNP-hard to decide whether there exists parameters \u03b8 such that \u2200m, ym = arg maxy f(y; xm, \u03b8).\nProof. Any parameters \u03b8 have an equivalent parameterization in canonical form (see section\nSec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are sepa-\nrable by some \u03b8 \u2208 \u0398can. We reduce from unweighted Max-Cut. The Max-Cut problem is to decide,\ngiven an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph\nas G, with k = 3 states per variable. We construct a small set of examples where a parameter vector\nwill exist that separates the data if and only if there is no cut of K or more edges in G.\nij(yi, yj) = 1 if (yi, yj) \u2208 {(1, 2), (2, 1)}, 0 if\nLet \u03b8 be parameters in canonical form equivalent to \u03b8\nyi = yj, and \u2212n2 if (yi, yj) \u2208 {(1, 3), (2, 3), (3, 1), (3, 2)}. We \ufb01rst construct 4n + 8|E| examples,\nusing the technique described in Sec. 2.1 (also supplementary material), which when restricted to\nthe space \u0398can, constrain the parameters to equal \u03b8. We then use one more example (xm, ym) where\nym = 3 (every node is in state 3) and, for all i, xm\ni (2) = 0. The \ufb01rst\ni (3) = K\u22121\nn\ntwo states encode the original Max-Cut instance, while the third state is used to construct a labeling\nym that has value equal to K \u2212 1, and is otherwise not used.\nLet K\u2217 be the value of the maximum cut in G. If in any assignment to the last example there is a\nvariable taking the state 3 and another variable taking the state 1 or 2, then the assignment\u2019s value\nwill be at most K\u2217\u2212 n2, which is less than zero. By construction, the 3 assignment has value K \u22121.\nThus, the optimal assignment must either be 3 with value K \u2212 1, or some combination of states 1\nand 2, which has value at most K\u2217. If K\u2217 > K \u2212 1 then 3 is not optimal and the examples are not\nseparable. If K\u2217 \u2264 K \u2212 1, the examples are separable.\nThis result illustrates the potential dif\ufb01culty of learning in worst-case graphs. Nonetheless, many\nproblems have a more restricted dependence on the input. For example, in computer vision, edge\npotentials may depend only on the difference in color between two adjacent pixels. Our results do\nnot preclude positive results of learnability in such restricted settings. By establishing hardness of\nlearning, we also close the open problem of relating hardness of inference and learning in structured\nprediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g.,\nstructured perceptron). Thus, when learning is hard, inference must be hard as well.\n\ni (1) = xm\n\nand xm\n\n(cid:48)\n\n4 Experiments\n\nTo evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets.\nWe show that, as the number of training samples grows, the accuracy of the pseudo-max method im-\nproves and its speed-up gain over competing algorithms increases. Our learning algorithm cor-\nresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss,\ne(yi, ym\n\nM(cid:88)\nnm(cid:88)\ni ) = 1{yi (cid:54)= ym\ni }/nm (nm is the number of labels in example m):\nC(cid:80)\n\n\u2212i, yi; xm, \u03b8) \u2212 f(ym; xm, \u03b8) + e(yi, ym\ni )\n\n+ (cid:107)\u03b8(cid:107)2 .\n\n(10)\n\nmin\n\u03b8\n\nmax\nyi\n\nf(ym\n\n(cid:105)\n\n(cid:104)\n\nm nm\n\nm=1\n\ni=1\n\nWe will compare the pseudo-max method with learning using structural SVMs, both with exact\ninference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time.\n\n6\n\n\f(a) Synthetic\n\n(b) Reuters\n\nFigure 2: Test error as a function of train size for various algorithms. Sub\ufb01gure (a) shows results for a synthetic\nsetting, while (b) shows performance on the Reuters data.\n\nIn the synthetic setting we use the discriminant function f(y; x, \u03b8) = (cid:80)\n(cid:80)\n\nij\u2208E \u03b8ij(yi, yj) +\ni xi\u03b8i(yi), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels.\nFor a weight vector \u03b8\u2217 (sampled once, uniformly in the range [\u22121, 1], and used for all train/test\nsets) we generate train and test instances by sampling xm uniformly in the range [\u22125, 5] and then\ncomputing the optimal labels ym = arg maxy\u2208Y f(y; xm, \u03b8\u2217).\nWe generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning al-\ngorithms, and measure the test error for the learned weights (with 1000 test samples). For each train\nsize we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison\nof the test error for the three learning algorithms. For small numbers of training examples, the test\nerror of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the\nerror converges to that of exact learning, as our consistency results predict.\nWe also test the performance of our algorithm on a multi-label document classi\ufb01cation task from the\nReuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of\nthe dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph\nstructure (see [4] for a similar setting). We use random subsamples of increasing size from the train\nset to learn the parameters, and then measure the test error using 20000 additional samples. For each\nsample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training\ndata as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max\nlearning gets close to that of the other methods. However, unlike the synthetic setting there is still a\nsmall gap, even after seeing the entire train set. This could be because the full dataset is not yet large\nenough to be in the consistent regime (note that exact learning has not \ufb02attened either), or because\nthe consistency conditions are not fully satis\ufb01ed: the data might be non-separable or the support of\nthe input distribution p(x) may be partial.\nWe next apply our method to the problem of learning the energy function for protein side-chain\nplacement, mirroring the learning setup of [14], where the authors train a conditional random \ufb01eld\n(CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The\nprediction problem for side-chain placement corresponds to \ufb01nding the most likely assignment in\na pairwise MRF, and \ufb01ts naturally into our learning framework. There are only 8 parameters to\nbe learned, corresponding to a reweighting of known energy terms. The dataset consists of 275\nproteins, where each MRF has several hundred variables (one per residue of the protein) and each\nvariable has on average 20 states. For prediction we use CPLEX\u2019s ILP solver.\nFig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an\nLP relaxation, solved with CPLEX, for \ufb01nding violated constraints.6 We generate training sets of\nincreasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the\nremaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits,\nwhereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms.\n\n5The authors\u2019 data and results are available from: http://cyanover.fhcrc.org/recomb-2007/\n6We signi\ufb01cantly optimized the cutting-plane algorithm, e.g. including a large number of initial cutting-\n\nplanes and restricting the weight vector to be positive (which we know to hold at optimality).\n\n7Speci\ufb01cally, for each protein we compute the fraction of correctly predicted \u03c71 and \u03c72 angles for all\nresidues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins.\n\n7\n\n10110210300.050.10.150.2Train sizeTest error exactLP\u2212relaxationpseudo\u2212max10110210310400.10.20.30.4Train sizeTest error exactLP\u2212relaxationpseudo\u2212max\fFigure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudo-\nmax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of\nlearning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy\nthan both the LP relaxation and HCRF (given roughly \ufb01ve times more data) for a fraction of the training time.\n\nThe original weights (\u201cSoft rep\u201d [3]) used for this energy function have 26.7% error across all 275\nproteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain\n25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate).\nTo get a sense of the dif\ufb01culty of this learning task, we also tried a random positive weight vector,\nuniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse\nif we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn\nparameters in under a minute that give better accuracy than the HCRF. The speed-up of training with\npseudo-max (using CPLEX\u2019s QP solver) versus cutting-plane is striking. For example, for M = 10,\npseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm\ntook a prohibitive amount of time to be able to run on the larger training sets. Since the data used\nin learning for protein side-chain placement is both highly non-separable and relatively little, these\npositive results illustrate the potential wide-spread applicability of the pseudo-max method.\n\n5 Discussion\n\nThe key idea of our method is to \ufb01nd parameters that prefer the true assignment ym over assignments\nthat differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this\nweak requirement is suf\ufb01cient to achieve consistency given a rich enough input distribution. One\nextension of our approach is to add constraints for assignments that differ from ym in more than one\nvariable. This would tighten the outer bound on \u0398 and possibly result in improved performance, but\nwould also increase computational complexity. We could also add such competing assignments via\na cutting-plane scheme so that optimization is performed only over a subset of these constraints.\nOur work raises a number of important open problems: It would be interesting to derive generaliza-\ntion bounds to understand the convergence rate of our method, as well as understanding the effect of\nthe distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the\none hand, it needs to explore the space Y in the sense that a suf\ufb01cient number of labels need to be\nobtained as the correct label for the true parameters (this is indeed used in our consistency proofs).\nOn the other hand, p(x) needs to be suf\ufb01ciently sensitive close to the decision boundaries so that\nthe true parameters can be inferred. We expect that generalization analysis will depend on these two\nproperties of p(x). Note that [11] studied active learning schemes for structured data and may be\nrelevant in the current context.\nHow should one apply this learning algorithm to non-separable data sets? We suggested one ap-\nproach, based on using a hinge loss for each of the pseudo constraints. One question in this context\nis, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity\nof pseudo-likelihood methods to model mis-speci\ufb01cation [8], and it would be interesting to perform\na similar analysis here. Also, is it possible to give any guarantees for the empirical and expected\nrisks (with respect to exact inference) obtained by outer bound learning versus exact learning?\nFinally, our algorithm demonstrates a phenomenon where more data can make computation easier.\nSuch a scenario was recently analyzed in the context of supervised learning [12], and it would be\ninteresting to combine the approaches.\nAcknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was sup-\nported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship.\n\n8\n\n0501001502002500.250.2550.260.2650.27Train sizeTest error (\u03c71 and \u03c72) pseudo\u2212maxLP\u2212relaxationSoft rep050100150200250050100150200250Train sizeTime to train (minutes) pseudo\u2212maxLP\u2212relaxation\fReferences\n[1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179\u2013195, 1975.\n[2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with\n\nperceptron algorithms. In EMNLP, 2002.\n\n[3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani,\nE. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabi-\nlization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology,\n366(4):1209 \u2013 1221, 2007.\n\n[4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings\n\nof the 25th International Conference on Machine Learning 25, pages 304\u2013311. ACM, 2008.\n\n[5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Infor-\n\nmation Processing Systems 20, pages 785\u2013792. 2008.\n\n[7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization\n\nresearch. JMLR, 5:361\u2013397, 2004.\n\n[8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood\nestimators. In Proceedings of the 25th international conference on Machine learning, pages 584\u2013591,\nNew York, NY, USA, 2008. ACM Press.\n\n[9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural\n\nlanguage parsing. In ICML 26, pages 713\u2013720, 2009.\n\n[10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction.\n\nIn AISTATS, 2007.\n\n[11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European\n\nConference on Machine Learning (ECML). Springer, September 2006.\n\n[12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Pro-\n\nceedings of the 25th international conference on Machine learning, pages 928\u2013935. ACM, 2008.\n\n[13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information\n\nProcessing Systems 16, pages 25\u201332. 2004.\n\n[14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain\n\nprediction. Journal of Computational Biology, 15(7):899\u2013911, 2008.\n\n9\n\n\f", "award": [], "sourceid": 809, "authors": [{"given_name": "David", "family_name": "Sontag", "institution": null}, {"given_name": "Ofer", "family_name": "Meshi", "institution": null}, {"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}