{"title": "Structured Learning via Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 647, "page_last": 655, "abstract": "A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is \u201csmoothed\u201d through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an \u201coracle\u201d exists to minimize a logistic loss.", "full_text": "Structured Learning via Logistic Regression\n\nNICTA and The Australian National University\n\nJustin Domke\n\njustin.domke@nicta.com.au\n\nAbstract\n\nA successful approach to structured learning is to write the learning objective as\na joint function of linear parameters and inference messages, and iterate between\nupdates to each. This paper observes that if the inference problem is \u201csmoothed\u201d\nthrough the addition of entropy terms, for \ufb01xed messages, the learning objective\nreduces to a traditional (non-structured) logistic regression problem with respect\nto parameters. In these logistic regression problems, each training example has a\nbias term determined by the current set of messages. Based on this insight, the\nstructured energy function can be extended from linear factors to any function\nclass where an \u201coracle\u201d exists to minimize a logistic loss.\n\n1 Introduction\n\nThe structured learning problem is to \ufb01nd a function F (x, y) to map from inputs x to outputs as\ny\u2217 = arg maxy F (x, y). F is chosen to optimize a loss function de\ufb01ned on these outputs. A\nmajor challenge is that evaluating the loss for a given function F requires solving the inference\noptimization to \ufb01nd the highest-scoring output y for each exemplar, which is NP-hard in general.\nA standard solution to this is to write the loss function using an LP-relaxation of the inference\nproblem, meaning an upper-bound on the true loss. The learning problem can then be phrased as a\njoint optimization of parameters and inference variables, which can be solved, e.g., by alternating\nmessage-passing updates to inference variables with gradient descent updates to parameters [16, 9].\nPrevious work has mostly focused on linear energy functions F (x, y) = wT \u03a6(x, y), where a vector\n\nof weights w is adjusted in learning, and \u03a6(x, y) = !\u03b1 \u03a6(x, y\u03b1) decomposes over subsets of\n\nvariables y\u03b1. While linear weights are often useful in practice [23, 16, 9, 3, 17, 12, 5], it is also\ncommon to make use of non-linear classi\ufb01ers. This is typically done by training a classi\ufb01er (e.g.\nensembles of trees [20, 8, 25, 13, 24, 18, 19] or multi-layer perceptrons [10, 21]) to predict each\nvariable independently. Linear edge interaction weights are then learned, with unary classi\ufb01ers\neither held \ufb01xed [20, 8, 25, 13, 24, 10] or used essentially as \u201cfeatures\u201d with linear weights re-\nadjusted [18].\n\nThis paper allows the more general form F (x, y) =!\u03b1 f\u03b1(x, y\u03b1). The learning problem is to select\n\nf\u03b1 from some set of functions F\u03b1. Here, following previous work [15], we add entropy smoothing\nto the LP-relaxation of the inference problem. Again, this leads to phrasing the learning problem as a\njoint optimization of learning and inference variables, alternating between message-passing updates\nto inference variables and optimization of the functions f\u03b1. The major result is that minimization of\nthe loss over f\u03b1 \u2208 F\u03b1 can be re-formulated as a logistic regression problem, with a \u201cbias\u201d vector\nadded to each example re\ufb02ecting the current messages incoming to factor \u03b1. No assumptions are\nneeded on the sets of functions F\u03b1, beyond assuming that an algorithm exists to optimize the logistic\nloss on a given dataset over all f\u03b1 \u2208 F\u03b1\nWe experimentally test the results of varying F\u03b1 to be the set of linear functions, multi-layer per-\nceptrons, or boosted decision trees. Results verify the bene\ufb01ts of training \ufb02exible function classes\nin terms of joint prediction accuracy.\n\n1\n\n\f2 Structured Prediction\n\nThe structured prediction problem can be written as seeking a function h that will predict an output\ny from an input x. Most commonly, it can be written in the form\nwT \u03a6(x, y),\n\nh(x; w) = arg max\n\n(1)\n\ny\n\nwhere \u03a6 is a \ufb01xed function of both x and y. The maximum takes place over all con\ufb01gurations of the\ndiscrete vector y. It is further assumed that \u03a6 decomposes into a sum of functions evaluated over\nsubsets of variables y\u03b1 as\n\n\u03a6(x, y) =!\u03b1\n\n\u03a6\u03b1(x, y\u03b1).\n\nThe learning problem is to adjust set of linear weights w. This paper considers the structured learning\nproblem in a more general setting, directly handling nonlinear function classes. We generalize the\nfunction h to\n\nwhere the energy F again decomposes as\n\nh(x; F ) = arg max\n\ny\n\nF (x, y),\n\nF (x, y) =!\u03b1\n\nf\u03b1(x, y\u03b1).\n\nThe learning problem now becomes to select {f\u03b1 \u2208 F\u03b1} for some set of functions F\u03b1. This reduces\nto the previous case when f\u03b1(x, y\u03b1) = wT \u03a6\u03b1(x, y\u03b1) is a linear function. Here, we do not make any\nassumption on the class of functions F\u03b1 other than assuming that there exists an algorithm to \ufb01nd\nthe best function f\u03b1 \u2208 F\u03b1 in terms of the logistic regression loss (Section 6).\n\n3 Loss Functions\n\nGiven a dataset (x1, y1), ..., (xN , yN ), we wish to select the energy F to minimize the empirical risk\n\nfor some loss function l. Absent computational concerns, a standard choice would be the slack-\nrescaled loss [22]\n\nR(F ) =!k\n\nl(xk, yk; F ),\n\n(2)\n\nl0(xk, yk; F ) = max\n\ny\n\nF (xk, y) \u2212 F (xk, yk) + \u2206(yk, y),\n\n(3)\n\nwhere \u2206(yk, y) is some measure of discrepancy. We assume that \u2206 is a function that decomposes\n\nover \u03b1, (i.e. that \u2206(yk, y) =\"\u03b1 \u2206\u03b1(yk\n\nIn Eq. 3, the maximum ranges over all possible discrete labelings y, which is in NP-hard in general.\nIf this inference problem must be solved approximately, there is strong motivation [6] for using\nrelaxations of the maximization in Eq. 1, since this yields an upper-bound on the loss. A common\nsolution [16, 14, 6] is to use a linear relaxation1\n\n\u03b1, y\u03b1)). Our experiments use the Hamming distance.\n\nl1(xk, yk; F ) = max\n\u00b5\u2208M\n\nF (xk, \u00b5) \u2212 F (xk, yk) + \u2206(yk, \u00b5),\n\n(4)\n\nwhere the local polytope M is de\ufb01ned as the set of local pseudomarginals that are normalized, and\nagree when marginalized over other neighboring regions,\n\nM = {\u00b5|\u00b5\u03b1\u03b2(y\u03b2) = \u00b5\u03b2(y\u03b2) \u2200\u03b2 \u2282 \u03b1, !y\u03b1\n\n\u00b5\u03b1(y\u03b1) = 1 \u2200\u03b1, \u00b5\u03b1(y\u03b1) \u2265 1 \u2200\u03b1, y\u03b1}.\n\nHere, \u00b5\u03b1\u03b2(y\u03b2) = \"y\u03b1\\\u03b2\n\n\u00b5\u03b1(y\u03b1) is \u00b5\u03b1 marginalized out over some region \u03b2 contained in \u03b1. It is\neasy to show that l1 \u2265 l0, since the two would be equivalent if \u00b5 were restricted to binary values,\nand hence the maximization in l1 takes place over a larger set [6]. We also de\ufb01ne\n\n\u03b8k\nF (y\u03b1) = f\u03b1(xk, y\u03b1) + \u2206\u03b1(yk\n\n\u03b1, y\u03b1),\n\n(5)\n\n1Here, F and \u2206 are slightly generalized to allow arguments of pseudomarginals, as F (xk, \u00b5) =\n\n!\u03b1 !y\u03b1\n\nf(xk, y\u03b1)\u00b5(y\u03b1) and \u2206(yk, \u00b5) = !\u03b1 !y\u03b1\n\n\u2206\u03b1(yk\n\n\u03b1, y\u03b1)\u00b5(y\u03b1).\n\n2\n\n\fwhich gives the equivalent representation of l1 as l1(xk, yk; F ) = \u2212F (xk, yk) + max\u00b5\u2208M \u03b8k\nF \u00b7 \u00b5.\nThe maximization in l1 is of a linear objective under linear constraints, and is thus a linear program\n(LP), solvable in polynomial time using a generic LP solver. In practice, however, it is preferable to\nuse custom solvers based on message-passing that exploit the sparsity of the problem.\nHere, we make a further approximation to the loss, replacing the inference problem of max\u00b5\u2208M \u03b8 \u00b7\u00b5\n\nwith the \u201csmoothed\u201d problem max\u00b5\u2208M \u03b8 \u00b7 \u00b5 + \u0001!\u03b1 H(\u00b5\u03b1), where H(\u00b5\u03b1) is the entropy of the\n\nmarginals \u00b5\u03b1. This approximation has been considered by Meshi et al. [15] who show that local\nmessage-passing can have a guaranteed convergence rate, and by Hazan and Urtasun [9] who use it\nfor learning. The relaxed loss is\n\nl(xk, yk; F ) = \u2212F (xk, yk) + max\n\n\u00b5\u2208M\"\u03b8k\n\nF \u00b7 \u00b5 + \u0001#\u03b1\n\nH(\u00b5\u03b1)$ .\n\n(6)\n\nSince the entropy is positive, this is clearly a further upper-bound on the \u201cunsmoothed\u201d loss, i.e.\nl1 \u2264 l. Moreover, we can bound the looseness of this approximation as in the following theorem,\nproved in the appendix. A similar result was previously given [15] bounding the difference of the\nobjective obtained by inference with and without entropy smoothing.\nTheorem 1. l and l1 are bounded by (where |y\u03b1| is the number of con\ufb01gurations of y\u03b1)\n\nl1(x, y, F ) \u2264 l(x, y, F ) \u2264 l1(x, y, F ) + \u0001Hmax, Hmax =#\u03b1\n\nlog |y\u03b1|.\n\n4 Overview\n\nNow, the learning problem is to select the functions f\u03b1 composing F to minimize R as de\ufb01ned in\nEq. 2. The major challenge is that evaluating R(F ) requires performing inference. Speci\ufb01cally, if\nwe de\ufb01ne\n\nthen we have that\n\nA(\u03b8) = max\n\u00b5\u2208M\n\n\u03b8 \u00b7 \u00b5 + \u0001#\u03b1\n\nH(\u00b5\u03b1),\n\n(7)\n\nmin\nF\n\nR(F ) = min\n\nF #k %\u2212F (xk, yk) + A(\u03b8k\nF )& .\n\nSince A(\u03b8) contains a maximization, this is a saddle-point problem. Inspired by previous work\n[16, 9], our solution (Section 5) is to introduce a vector of \u201cmessages\u201d \u03bb to write A in the dual form\n\nwhich leads to phrasing learning as the joint minimization\n\nA(\u03b8) = min\n\nA(\u03bb, \u03b8),\n\n\u03bb\n\nmin\nF\n\nmin\n\n{\u03bbk}#k \u2019\u2212F (xk, yk) + A(\u03bbk, \u03b8k\nF )( .\n\nWe propose to solve this through an alternating optimization of F and {\u03bbk}. For \ufb01xed F , message-\npassing can be used to perform coordinate ascent updates to all the messages \u03bbk (Section 5). These\nupdates are trivially parallelized with respect to k. However, the problem remains, for \ufb01xed mes-\nsages, how to optimize the functions f\u03b1 composing F . Section 7 observes that this problem can be\nre-formulated into a (non-structured) logistic regression problem, with \u201cbias\u201d terms added to each\nexample that re\ufb02ect the current messages into factor \u03b1.\n\n5 Inference\n\nIn order to evaluate the loss, it is necessary to solve the maximization in Eq. 6. For a given \u03b8,\nconsider doing inference over \u00b5, that is, in solving the maximization in Eq. 7. Standard Lagrangian\nduality theory gives the following dual representation for A(\u03b8) in terms of \u201cmessages\u201d \u03bb\u03b1(x\u03b2) from\na region \u03b1 to a subregion \u03b2 \u2282 \u03b1, a variant of the representation of Heskes [11].\n\n3\n\n\fAlgorithm 1 Reducing structured learning to logistic regression.\nFor all k, \u03b1, initialize \u03bbk(y\u03b1) \u2190 0.\nRepeat until convergence:\n1. For all k, for all \u03b1, set the bias term to\n\n\u03b1, y\u03b1) + #\u03b2\u2282\u03b1\n\n\u03bbk\n\n\u03b1(y\u03b2) \u2212#\u03b3\u2283\u03b1\n\n\u03bbk\n\n\u03b3(y\u03b1)\uf8f6\n\uf8f8 .\n\nbk\n\u03b1(y\u03b1) \u2190\n\n1\n\n\u0001 \uf8eb\n\uf8ed\u2206(yk\n#k=1&\u2019f\u03b1(xk, yk\n\nK\n\n2. For all \u03b1, solve the logistic regression problem\n\nf\u03b1 \u2190 arg max\nf\u03b1\u2208F\u03b1\n\n\u03b1) + bk\n\n\u03b1(yk\n\n\u03b1)( \u2212 log#y\u03b1\n\nexp\u2019f\u03b1(xk, y\u03b1) + bk\n\n\u03b1(y\u03b1)() .\n\n3. For all k, for all \u03b1, form updated parameters as\n\n\u03b8k(y\u03b1) \u2190 \u0001f\u03b1(xk, y\u03b1) + \u2206(yk\n\n\u03b1, y\u03b1).\n\n4. For all k, perform a \ufb01xed number of message-passing iterations to update \u03bbk using \u03b8k. (Eq. 10)\n\nTheorem 2. A(\u03b8) can be represented in the dual form A(\u03b8) = min\u03bb A(\u03bb, \u03b8), where\n\nA(\u03bb, \u03b8) = max\n\u00b5\u2208N\n\n\u03b8 \u00b7 \u00b5 + \u0001#\u03b1\n\nH(\u00b5\u03b1) +#\u03b1 #\u03b2\u2282\u03b1#x\u03b2\n\n\u03bb\u03b1(x\u03b2) (\u00b5\u03b1\u03b2(y\u03b2) \u2212 \u00b5\u03b2(y\u03b2)) ,\n\n(8)\n\n\u00b5\u03b1(y\u03b1) = 1, \u00b5\u03b1(y\u03b1) \u2265 0} is the set of locally normalized pseudomarginals.\n\nand N = {\u00b5|*y\u03b1\n\nMoreover, for a \ufb01xed \u03bb, the maximizing \u00b5 is given by\n\n\u00b5\u03b1(y\u03b1) =\n\n\u0001 \uf8eb\n\uf8ed\u03b8(y\u03b1) + #\u03b2\u2282\u03b1\nwhere Z\u03b1 is a normalizing constant to ensure that*y\u03b1\n\nexp\uf8eb\n\uf8ed\n\n1\nZ\u03b1\n\n1\n\n\u03bb\u03b1(y\u03b2) \u2212#\u03b3\u2283\u03b1\n\n\u00b5\u03b1(y\u03b1) = 1.\n\n\u03bb\u03b3(y\u03b1)\uf8f6\n\uf8f8\n\n\uf8f6\n\uf8f8 ,\n\n(9)\n\nThus, for any set of messages \u03bb, there is an easily-evaluated upper-bound A(\u03bb, \u03b8) \u2265 A(\u03b8), and when\nA(\u03bb, \u03b8) is minimized with respect to \u03bb, this bound is tight. The standard approach to performing\nthe minimization over \u03bb is essentially block-coordinate descent. There are variants, depending on\nthe size of the \u201cblock\u201d that is updated. In our experiments, we use blocks consisting of the set of all\nmessages \u03bb\u03b1(y\u03bd) for all regions \u03b1 containing \u03bd. When the graph only contains regions for single\nvariables and pairs, this is a \u201cstar update\u201d of all the messages from pairs that contain a variable i. It\ncan be shown [11, 15] that the update is\n\n\u03bb$\n\u03b1(y\u03bd) \u2190 \u03bb\u03b1(y\u03bd) +\n\n\u0001\n\n1 + N\u03bd\n\n(log \u00b5\u03bd(y\u03bd) + #\u03b1!\u2283\u03bd\n\nlog \u00b5\u03b1!(y\u03bd)) \u2212 \u0001 log \u00b5\u03b1(y\u03bd),\n\n(10)\n\nfor all \u03b1 \u2283 \u03bd, where N\u03bd = |{\u03b1|\u03b1 \u2283 \u03bd}|. Meshi et al. [15] show that with greedy or randomized\nselection of blocks to update, O( 1\n\n\u03b4 ) iterations are suf\ufb01cient to converge within error \u03b4.\n\n6 Logistic Regression\n\nLogistic regression is traditionally understood as de\ufb01ning a conditional distribution p(y|x; W ) =\nexp ((W x)y) /Z(x) where W is a matrix that maps the input features x to a vector of mar-\ngins W x.\nlikelihood training problem\n\nthe maximum conditional\n\nis easy to show that\n\nIt\n\nmaxW *k log p(yk|xk; W ) is equivalent to\n\nmax\n\nW #k &(W xk)yk \u2212 log#y\n\nexp(W xk)y) .\n\n4\n\n\fHere, we generalize this in two ways. First, rather than taking the mapping from features x to the\nmargin for label y as the y-th component of W x, we take it as f (x, y) for some function f in a set\nof function F. (This reduces to the linear case when f (x, y) = (W x)y.) Secondly, we assume that\nthere is a pre-determined \u201cbias\u201d vector bk associated with each training example. This yields the\nlearning problem\n\nmax\n\nf \u2208F !k \"#f (xk, yk) + bk(yk)$ \u2212 log!y\n\nexp#f (xk, y) + bk(y)$% ,\n\n(11)\n\nAside from linear logistic regression, one can see decision trees, multi-layer perceptrons, and\nboosted ensembles under an appropriate loss as solving Eq. 11 for different sets of functions F\n(albeit possibly to a local maximum).\n\n7 Training\n\nRecall that the learning problem is to select the functions f\u03b1 \u2208 F\u03b1 so as to minimize the empirical\nF )]. At \ufb01rst blush, this appears challenging, since evaluating\nA(\u03b8) requires solving a message-passing optimization. However, we can use the dual representation\nof A from Theorem 2 to represent minF R(F ) in the form\n\nrisk R(F ) = &k[\u2212F (xk, yk) + A(\u03b8k\n\nmin\nF\n\nmin\n\n{\u03bbk}!k \u2019\u2212F (xk, yk) + A(\u03bbk, \u03b8k\nF )( .\n\n(12)\n\nTo optimize Eq. 12, we alternating between optimization of messages {\u03bbk} and energy functions\n{f\u03b1}. Optimization with respect to \u03bbk for \ufb01xed F decomposes into minimizing A(\u03bbk, \u03b8k\nF ) indepen-\ndently for each yk, which can be done by running message-passing updates as in Section 5 using the\nparameter vector \u03b8k\nF . Thus, the rest of this section is concerned with how to optimize with respect\nto F for \ufb01xed messages. Below, we will use a slight generalization of a standard result [1, p. 93].\nLemma 3. The conjugate of the entropy is the \u201clog-sum-exp\u201d function. Formally,\n\nmax\n\nx:xT 1=1,x\u22650\n\n\u03b8 \u00b7 x \u2212 \u03c1!i\n\nxi log xi = \u03c1 log!i\n\nexp\n\n\u03b8i\n\u03c1\n\n.\n\nTheorem 4. If f \u2217\n\n\u03b1 is the minimizer of Eq 12 for \ufb01xed messages \u03bb, then\n\nf \u2217\n\u03b1 = \u0001 arg max\n\n\u03b1) + bk\n\n\u03b1(yk\n\nwhere the set of biases are de\ufb01ned as\n\nf\u03b1 !k \"#f\u03b1(xk, yk\n\u0001 \uf8eb\n\uf8ed\u2206(yk\n\nbk\n\u03b1(y\u03b1) =\n\n1\n\n\u03b1)$ \u2212 log!y\u03b1\n\nexp#f\u03b1(xk, y\u03b1) + bk\n\n\u03b1(y\u03b1)$% ,\n\n\u03b1, y\u03b1) + !\u03b2\u2282\u03b1\n\n\u03bb\u03b1(y\u03b2) \u2212!\u03b3\u2283\u03b1\n\n\u03bb\u03b3(y\u03b1)\uf8f6\n\uf8f8 .\n\n(13)\n\n(14)\n\nProof. Substituting A(\u03bb, \u03b8) from Eq. 8 and \u03b8k from Eq. 5 gives that\n\nA(\u03bbk, \u03b8k\n\nF ) = max\n\n\u00b5\u2208N !\u03b1 !y\u03b1 #f\u03b1(xk, y\u03b1) + \u2206\u03b1(yk\n\nH(\u00b5\u03b1)\n\n\u03b1, y\u03b1)$ \u00b5(y\u03b1) + \u0001!\u03b1\n+!\u03b1 !\u03b2\u2282\u03b1!x\u03b2\n\n\u03bbk\n\u03b1(x\u03b2) (\u00b5\u03b1\u03b2(y\u03b2) \u2212 \u00b5\u03b2(y\u03b2)) .\n\nUsing the de\ufb01nition of bk from Eq. 14 above, this simpli\ufb01es into\n\nA(\u03bbk, \u03b8k\n\nF ) =!\u03b1\n\nmax\n\n\u00b5\u03b1\u2208N\u03b1-!y\u03b1\n\n(f\u03b1(x, y\u03b1) + \u0001b\u03b1(y\u03b1)) \u00b5\u03b1(y\u03b1) + \u0001H(\u00b5\u03b1). ,\n\n5\n\n\fDenoising\n\nFi \\ Fij Zero Const. Linear Boost. MLP\n.502\n.502\n.034\n.007\n.008\n\nZero\nConst.\nLinear\nBoost.\nMLP\n\n.502\n.502\n.059\n.015\n.015\n\n.502\n.502\n.444\n.444\n.445\n\n.502\n.502\n.077\n.034\n.032\n\n.511\n.510\n.049\n.009\n.009\n\nHorses\n\nFi \\ Fij Zero Const. Linear Boost. MLP\n.245\n.245\n.156\n.086\n.081\n\nZero\nConst.\nLinear\nBoost.\nMLP\n\n.244\n.244\n.154\n.084\n.080\n\n.246\n.246\n.185\n.103\n.096\n\n.246\n.246\n.185\n.098\n.094\n\n.247\n.247\n.168\n.092\n.087\n\nTable 1: Univariate Test Error Rates (Train Errors in Appendix)\n\nmarginals. Applying Lemma 3 to the inner maximization gives the closed-form expression\n\n\u00b5\u03b1(y\u03b1) = 1, \u00b5\u03b1(y\u03b1) \u2265 0} enforces that \u00b5\u03b1 is a locally normalized set of\n\nwhere N\u03b1 = {\u00b5\u03b1|!y\u03b1\n\nA(\u03bbk, \u03b8k\n\nF ) =\"\u03b1\n\n\u0001 log\"y\u03b1\n\nexp# 1\n\n\u0001\n\nf\u03b1(x, y\u03b1) + b\u03b1(y\u03b1)$ .\n\nThus, minimizing Eq. 12 with respect to F is equivalent to \ufb01nding (for all \u03b1)\n\narg max\n\nf\u03b1 \"k %f\u03b1(xk, yk\nf\u03b1 \"k % 1\n\n\u0001\n\n\u0001\n\n\u03b1) \u2212 \u0001 log\"y\u03b1\n\u03b1) \u2212 log\"y\u03b1\n\nexp# 1\nexp# 1\n\n= arg max\n\nf\u03b1(xk, yk\n\n\u03b1(y\u03b1)$&\n\u03b1(y\u03b1)$&\n\nf\u03b1(x, y\u03b1) + bk\n\nf (xk, y\u03b1) + bk\n\n\u0001\n\nObserving that adding a bias term doesn\u2019t change the maximizing f\u03b1, and using the fact that\narg max g( 1\n\n\u0001 \u00b7) = \u0001 arg max g(\u00b7) gives the result.\n\nThe \ufb01nal learning algorithm is summarized as Alg. 1. Sometimes, the local classi\ufb01er f\u03b1 will\ndepend on the input x only through some \u201clocal features\u201d \u03c6\u03b1. The above framework accomodates\nthis situation if the set F\u03b1 is considered to select these local features.\nIn practice, one will often wish to constrain that some of the functions f\u03b1 are the same.\nThis is done by taking the sum in Eq.\n13 not just over all data k, but also over all fac-\ntors \u03b1 that should be so constrained. For example, it is common to model image segmen-\n\nare functions mapping local features to local energies. In this case, u would be selected to max-\n\ntation problems using a 4-connected grid with an energy like F (x, y) = !i u(\u03c6i, yi) +\n!ij v(\u03c6ij, yi, yj), where \u03c6i/\u03c6ij are univariate/pairwise features determined by x, and u and v\ni (yi))*, and analogous expres-\nimize!k!i\u2019(u(\u03c6k\n\nsion exists for v. This is the framework used in the following experiments.\n\ni )) \u2212 log!yi\n\nexp(u(\u03c6k\n\ni , yi) + bk\n\ni (yk\n\ni , yk\n\ni ) + bk\n\n8 Experiments\n\nThese experiments consider three different function classes:\nlinear, boosted decision trees, and\nmulti-layer perceptrons. To maximize Eq. 11 under linear functions f (x, y) = (W x)y, we sim-\nply compute the gradient with respect to W and use batch L-BFGS. For a multi-layer perceptron,\nwe \ufb01t the function f (x, y) = (W \u03c3(U x))y using stochastic gradient descent with momentum2 on\nmini-batches of size 1000, using a step size of .25 for univariate classi\ufb01ers and .05 for pairwise.\nBoosted decision trees use stochastic gradient boosting [7]: the gradient of the logistic loss is com-\nputed for each exemplar, and a regression tree is induced to \ufb01t this (one tree for each class). To\ncontrol over\ufb01tting, each leaf node must contain at least 5% of the data. Then, an optimization ad-\njusts the values of leaf nodes to optimize the logistic loss. Finally, the tree values are multiplied by\n\n2At each time, the new step is a combination of .1 times the new gradient plus .9 times the old step.\n\n6\n\n\f400\n\n200\n\ni\n\nf\n\n0\n\n\u2212200\n\n \n\n\u2212400\n0\n\n100\n\n50\n\ny\n\ny\n\n=1\ni\n=2\ni\n\n0.2\n\n0.4\n\n\u03c6\ni\n\n0.6\n\n0.8\n\ny\n\ny\n\ny\n\ny\n\n=(1,1)\nij\n=(1,2)\nij\n=(2,1)\nij\n=(2,2)\nij\n\n \n\n400\n\n200\n\ni\n\nf\n\n0\n\n1\n\n \n\n\u2212200\n\n \n\n\u2212400\n0\n\n100\n\n50\n\n \n\ny\n=1\ni\ny\n=2\ni\n\n400\n\n200\n\ni\n\nf\n\n0\n\n \n\ny\n=1\ni\ny\n=2\ni\n\n1\n\n \n\n\u2212200\n\n \n\n\u2212400\n0\n\n100\n\n50\n\n0.2\n\n0.4\n\n\u03c6\ni\n\n0.6\n\n0.8\n\ny\n=(1,1)\nij\ny\n=(1,2)\nij\ny\n=(2,1)\nij\ny\n=(2,2)\nij\n\n1\n\n \n\n0.2\n\n0.4\n\n\u03c6\ni\n\n0.6\n\n0.8\n\ny\n=(1,1)\nij\ny\n=(1,2)\nij\ny\n=(2,1)\nij\ny\n=(2,2)\nij\n\nj\ni\n\nf\n\n0\n\nj\ni\n\nf\n\n0\n\nj\ni\n\nf\n\n0\n\n\u221250\n\n \n\n\u2212100\n0\n\n\u221250\n\n \n\n\u2212100\n0\n\n0.6\n\n0.8\n\n1\n\n0.2\n\n0.4\n\n\u03c6\nij\n\nLinear\n\n0.2\n\n0.4\n\n\u03c6\nij\n\n0.6\n\n0.8\n\n1\n\nBoosting\n\n\u221250\n\n \n\n\u2212100\n0\n\n0.6\n\n0.8\n\n1\n\n0.2\n\n0.4\n\n\u03c6\nij\n\nMLP\n\nFigure 1: The univariate (top) and pairwise (bottom) energy functions learned on denoising data.\nEach column shows the result of training both univariate and pairwise terms with one function class.\n\nLinear\n\nDenoising\n\nBoosting\n\nMLP Fi \\ Fij\n\nLinear\n\nHorses\nBoosting\n\nMLP Fi \\ Fij\n\n0.2\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0.2\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0.2\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0\n\nr\na\ne\nn\ni\nL\n\ng\nn\ni\nt\ns\no\no\nB\n\n0.2\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0.2\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0.2\n\nP\nL\nM\n\nr\no\nr\nr\n\nE\n\n0.1\n\n0\n0\n\nr\na\ne\nn\ni\nL\n\ng\nn\ni\nt\ns\no\no\nB\n\nP\nL\nM\n\n10\nIteration\n\n20\n\n0\n\n10\nIteration\n\n20\n\n0\n\n10\nIteration\n\n20\n\n20\nIteration\n\n40\n\n0\n\n20\nIteration\n\n40\n\n0\n\n20\nIteration\n\n40\n\nFigure 2: Dashed/Solid lines show univariate train/test error rates as a function of learning iterations\nfor varying univariate (rows) and pairwise (columns) classi\ufb01ers.\n\nInput\n\nTrue\n\nDenoising\n\nLinear Boosting MLP\n\nInput\n\nTrue\n\nHorses\nLinear Boosting MLP\n\nFigure 3: Example Predictions on Test Images (More in Appendix)\n\n7\n\n\fi = yk\n\nj , then \u03c6k\n\n!= yk\n\ni = 1, \u03c6k\n\nij is sampled from [0, .8] while if yk\ni\n\n.25 and added to the ensemble. For reference, we also consider the \u201czero\u201d classi\ufb01er, and a \u201cconstant\u201d\nclassi\ufb01er that ignores the input\u2013 equivalent to a linear classi\ufb01er with a single constant feature.\nAll examples use \u0001 = 0.1. Each learning iteration consists of updating fi, performing 25 iterations\nof message passing, updating fij , and then performing another 25 iterations of message-passing.\nThe \ufb01rst dataset is a synthetic binary denoising dataset, intended for the purpose of visualization. To\ncreate an example, an image is generated with each pixel random in [0, 1]. To generate y, this image\nis convolved with a Gaussian with standard deviation 10 and rounded to {0, 1}. Next, if yi = 0, \u03c6k\ni\nis sampled uniformly from [0, .9], while if yk\ni is sampled from [.1, 1]. Finally, for a pair\n(i, j), if yk\nj \u03c6ij is sampled from [.2, 1]. A\nconstant feature is also added to both \u03c6k\nThere are 16 100 \u00d7 100 images each training and testing. Test errors for each classi\ufb01er combination\nare in Table 1, learning curves are in Fig. 2, and example results in Fig. 3. The nonlinear classi\ufb01ers\nresult in both lower asymptotic training and testing errors and faster convergence rates. Boosting\nconverges particularly quickly. Finally, because there is only a single input feature for univariate\nand pairwise terms, the resulting functions are plotted in Fig. 1.\nSecond, as a more realistic example, we use the Weizmann horses dataset. We use 42 univariate\nfeatures f k\ni consisting of a constant (1) the RBG values of the pixel (3), the vertical and horizontal\nposition (2) and a histogram of gradients [2] (36). There are three edge features, consisting of a\nconstant, the l2 distance of the RBG vectors for the two pixels, and the output of a Sobel edge \ufb01lter.\nResults are show in Table 1 and Figures 2 and 3. Again, we see bene\ufb01ts in using nonlinear classi\ufb01ers,\nboth in convergence rate and asymptotic error.\n\ni and \u03c6k\nij.\n\n9 Discussion\n\nThis paper observes that in the structured learning setting, the optimization with respect to energy\ncan be formulated as a logistic regression problem for each factor, \u201cbiased\u201d by the current messages.\nThus, it is possible to use any function class where an \u201coracle\u201d exists to optimize a logistic loss.\nBesides the possibility of using more general classes of energies, another advantage of the proposed\nmethod is the \u201csoftware engineering\u201d bene\ufb01t of having the algorithm for \ufb01tting the energy modular-\nized from the rest of the learning procedure. The ability to easily de\ufb01ne new energy functions for\nindividual problems could have practical impact.\nFuture work could consider convergence rates of the overall learning optimization, systematically\ninvestigate the choice of \u0001, or consider more general entropy approximations, such as the Bethe\napproximation used with loopy belief propagation.\nIn related work, Hazan and Urtasun [9] use a linear energy, and alternate between updating all in-\nference variables and a gradient descent update to parameters, using an entropy-smoothed inference\nobjective. Meshi et al. [16] also use a linear energy, with a stochastic algorithm updating inference\nvariables and taking a stochastic gradient step on parameters for one exemplar at a time, with a pure\nLP-relaxation of inference. The proposed method iterates between updating all inference variables\nand performing a full optimization of the energy. This is a \u201cbatch\u201d algorithm in the sense of mak-\ning repeated passes over the data, and so is expected to be slower than an online method for large\ndatasets. In practice, however, inference is easily parallelized over the data, and the majority of\ncomputational time is spent in the logistic regression subproblems. A stochastic solver can easily be\nused for these, as was done for MLPs above, giving a partially stochastic learning method.\nAnother related work is Gradient Tree Boosting [4] in which to train a CRF, the functional gradient\nof the conditional likelihood is computed, and a regression tree is induced. This is iterated to produce\nan ensemble. The main limitation is the assumption that inference can be solved exactly. It appears\npossible to extend this to inexact inference, where the tree is induced to improve a dual bound, but\nthis has not been done so far. Experimentally, however, simply inducing a tree on the loss gradient\nleads to much slower learning if the leaf nodes are not modi\ufb01ed to optimize the logistic loss. Thus,\nit is likely that such a strategy would still bene\ufb01t from using the logistic regression reformulation.\n\n8\n\n\fReferences\n\n[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[3] Chaitanya Desai, Deva Ramanan, and Charless C. Fowlkes. Discriminative models for multi-class object\n\nlayout. International Journal of Computer Vision, 95(1):1\u201312, 2011.\n\n[4] Thomas G. Dietterich, Adam Ashenfelter, and Yaroslav Bulatov. Training conditional random \ufb01elds via\n\ngradient tree boosting. In ICML, 2004.\n\n[5] Justin Domke. Learning graphical model parameters with approximate marginal inference. PAMI,\n\n35(10):2454\u20132467, 2013.\n\n[6] Thomas Finley and Thorsten Joachims. Training structural svms when exact inference is intractable. In\n\nICML, 2008.\n\n[7] Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38:367\u2013\n\n378, 1999.\n\n[8] Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and Daphne Koller. Multi-class segmentation\n\nwith relative location prior. IJCV, 80(3):300\u2013316, 2008.\n\n[9] Tamir Hazan and Raquel Urtasun. Ef\ufb01cient learning of structured predictors in general graphical models.\n\nCoRR, abs/1210.2346, 2012.\n\n[10] Xuming He, Richard S. Zemel, and Miguel \u00c1. Carreira-Perpi\u00f1\u00e1n. Multiscale conditional random \ufb01elds\n\nfor image labeling. In CVPR, 2004.\n\n[11] Tom Heskes. Convexity arguments for ef\ufb01cient minimization of the bethe and kikuchi free energies. J.\n\nArtif. Intell. Res. (JAIR), 26:153\u2013190, 2006.\n\n[12] Sanjiv Kumar and Martial Hebert. Discriminative \ufb01elds for modeling spatial dependencies in natural\n\nimages. In NIPS, 2003.\n\n[13] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr. Associative hierarchical\n\nCRFs for object class image segmentation. In ICCV, 2009.\n\n[14] Andr\u00e9 F. T. Martins, Noah A. Smith, and Eric P. Xing. Polyhedral outer approximations with application\n\nto natural language parsing. In ICML, 2009.\n\n[15] Ofer Meshi, Tommi Jaakkola, and Amir Globerson. Convergence rate analysis of MAP coordinate mini-\n\nmization algorithms. In NIPS. 2012.\n\n[16] Ofer Meshi, David Sontag, Tommi Jaakkola, and Amir Globerson. Learning ef\ufb01ciently with approximate\n\ninference via dual losses. In ICML, 2010.\n\n[17] Sebastian Nowozin, Peter V. Gehler, and Christoph H. Lampert. On parameter learning in CRF-based\n\napproaches to object class image segmentation. In ECCV, 2010.\n\n[18] Sebastian Nowozin, Carsten Rother, Shai Bagon, Toby Sharp, Bangpeng Yao, and Pushmeet Kohli. De-\n\ncision tree \ufb01elds. In ICCV, 2011.\n\n[19] Florian Schroff, Antonio Criminisi, and Andrew Zisserman. Object class segmentation using random\n\nforests. In BMVC, 2008.\n\n[20] Jamie Shotton, John M. Winn, Carsten Rother, and Antonio Criminisi. Textonboost for image understand-\ning: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context.\nIJCV, 81(1):2\u201323, 2009.\n\n[21] Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In ICCV\n\nWorkshops, 2011.\n\n[22] Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In NIPS, 2003.\n[23] Jakob J. Verbeek and Bill Triggs. Scene segmentation with crfs learned from partially labeled images. In\n\nNIPS, 2007.\n\n[24] John M. Winn and Jamie Shotton. The layout consistent random \ufb01eld for recognizing and segmenting\n\npartially occluded objects. In CVPR, 2006.\n\n[25] Jianxiong Xiao and Long Quan. Multiple view semantic segmentation for street view images. In ICCV,\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 388, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "NICTA"}]}