{"title": "Deep Structured Prediction with Nonlinear Output Transformations", "book": "Advances in Neural Information Processing Systems", "page_first": 6320, "page_last": 6331, "abstract": "Deep structured models are widely used for tasks like semantic segmentation, where explicit correlations between variables provide important prior information which generally helps to reduce the data needs of deep nets. However, current deep structured models are restricted by oftentimes very local neighborhood structure, which cannot be increased for computational complexity reasons, and by the fact that the output configuration, or a representation thereof, cannot be transformed further. Very recent approaches which address those issues include graphical model inference inside deep nets so as to permit subsequent non-linear output space transformations. However, optimization of those formulations is challenging and not well understood. Here, we develop a novel model which generalizes existing approaches, such as structured prediction energy networks, and discuss a formulation which maintains applicability of existing inference techniques.", "full_text": "Deep Structured Prediction with Nonlinear\n\nOutput Transformations\n\nColin Graber\n\nOfer Meshi\u2020\n\nAlexander Schwing\n\ncgraber2@illinois.edu\n\nmeshi@google.com\n\naschwing@illinois.edu\n\nUniversity of Illinois at Urbana-Champaign\n\n\u2020Google\n\nAbstract\n\nDeep structured models are widely used for tasks like semantic segmentation,\nwhere explicit correlations between variables provide important prior information\nwhich generally helps to reduce the data needs of deep nets. However, current deep\nstructured models are restricted by oftentimes very local neighborhood structure,\nwhich cannot be increased for computational complexity reasons, and by the fact\nthat the output con\ufb01guration, or a representation thereof, cannot be transformed\nfurther. Very recent approaches which address those issues include graphical\nmodel inference inside deep nets so as to permit subsequent non-linear output\nspace transformations. However, optimization of those formulations is challenging\nand not well understood. Here, we develop a novel model which generalizes\nexisting approaches, such as structured prediction energy networks, and discuss a\nformulation which maintains applicability of existing inference techniques.\n\n1\n\nIntroduction\n\nNowadays, machine learning models are used widely across disciplines from computer vision and\nnatural language processing to computational biology and physical sciences. This wide usage\nis fueled, particularly in recent years, by easily accessible software packages and computational\nresources, large datasets, a problem formulation which is general enough to capture many cases of\ninterest, and, importantly, trainable high-capacity models, i.e., deep nets.\nWhile deep nets are a very convenient tool these days, enabling rapid progress in both industry and\nacademia, their training is known to require signi\ufb01cant amounts of data. One possible reason is the\nfact that prior information on the structural properties of output variables is not modeled explicitly.\nFor instance, in semantic segmentation, neighboring pixels are semantically similar, or in disparity\nmap estimation, neighboring pixels often have similar depth. The hope is that if such structural\nassumptions hold true in the data, then learning becomes easier (e.g., smaller sample complexity)\n[10]. To address a similar shortcoming of linear models, in the early 2000\u2019s, structured models were\nproposed to augment support vector machines (SVMs) and logistic regression. Those structured\nmodels are commonly referred to as \u2018Structured SVMs\u2019 [52, 54] and \u2018conditional random \ufb01elds\u2019 [27]\nrespectively.\nMore recently, structured models have also been combined with deep nets, \ufb01rst in a two-step training\nsetup where the deep net is trained before being combined with a structured model, e.g., [1, 8], and\nthen by considering a joint formulation [53, 59, 9, 42]. In these cases, structured prediction is used\non top of a deep net, using simple models for the interactions between output variables, such as plain\nsummation. This formulation may be limiting in the type of interactions it can capture. To address\nthis shortcoming, very recently, efforts have been conducted to include structured prediction inside,\ni.e., not on top of, a deep net. For instance, structured prediction energy networks (SPENs) [3, 4]\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Deep nets\n\n(b) Structured deep nets\n\n(c) SPENs\n\n(d) Non-linear structured\ndeep nets\n\nFigure 1: Comparison between different model\ntypes.\n\nFigure 2: A diagram of the proposed nonlin-\near structured deep network model. Each im-\nage is transformed via a 2-layer MLP (H) into\na 26-dimensional feature representation. Struc-\ntured inference uses this representation to pro-\nvide a feature vector y which is subsequently\ntransformed by another 2-layer MLP (T ) to pro-\nduce the \ufb01nal model score.\n\nwere proposed to reduce the excessively strict inductive bias that is assumed when computing a score\nvector with one entry per output space con\ufb01guration. Different from the aforementioned classical\ntechniques, SPENs compute independent prediction scores for each individual component of the\noutput as well as a global score which is obtained by passing a complete output prediction through\na deep net. Unlike prior approaches, SPENs do not allow for the explicit speci\ufb01cation of output\nstructure, and structural constraints are not maintained during inference.\nIn this work, we represent output variables as an intermediate structured layer in the middle of the\nneural architecture. This gives the model the power to capture complex nonlinear interactions between\noutput variables, which prior deep structured methods do not capture. Simultaneously, structural\nconstraints are enforced during inference, which is not the case with SPENs. We provide two intuitive\ninterpretations for including structured prediction inside a deep net rather than at its output. First,\nthis formulation allows one to explicitly model local output structure while simultaneously assessing\nglobal output coherence in an implicit manner. This increases the expressivity of the model without\nincurring the cost of including higher-order potentials within the explicit structure. A second view\ninterprets learning of the network above the structured \u2018output\u2019 as training of a loss function which is\nsuitable for the considered task.\nIncluding structure inside deep nets isn\u2019t trivial. For example, it is reported that SPENs are hard to\noptimize [4]. To address this issue, here, we discuss a rigorous formulation for structure inside deep\nnets. Different from SPENs which apply a continuous relaxation to the output space, here, we use a\nLagrangian framework. One advantage of the resulting objective is that any classical technique for\noptimization over the structured space can be readily applied.\nWe demonstrate the effectiveness of our proposed approach on real-world applications, including\nOCR, image tagging, multilabel classi\ufb01cation and semantic segmentation. In each case, the proposed\napproach is able to improve task performance over deep structured baselines.\n\n2 Related Work\n\nWe brie\ufb02y review related work and contrast existing approaches to our formulation.\n\nStructured Prediction:\nInterest in structured prediction sparked from the seminal works of Laf-\nferty et al. [27], Taskar et al. [52], Tsochantaridis et al. [54] and has continued to grow in recent\nyears. These techniques were originally formulated to augment linear classi\ufb01ers such as SVMs or\nlogistic regression with a model for correlations between multiple variables of interest. Although\nthe prediction problem (i.e., inference) in such models is NP-hard in general [47], early work on\nstructured prediction focused on special cases where the inference task was tractable. Later work\n\n2\n\nIndependent Structured Independent Bottom up Top down Bottom up Top down Structured MLPTop...\u00a0...\u00a0T(c,y,w)...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0...\u00a0,x4x5,x3x4,x2x3x5x4x3x1x2,x1x2\faddressed cases where inference was intractable and focused on designing ef\ufb01cient formulations and\nalgorithms.\nExisting structured prediction formulations de\ufb01ne a score, which is often assumed to consist of\nmultiple local functions, i.e., functions which depend on small subsets of the variables. The parameters\nof the score function are learned from a given dataset by encouraging that the score for a ground-truth\ncon\ufb01guration is higher than that of any other con\ufb01guration. Several works studied the learning\nproblem when inference is hard [13, 26, 40, 17, 33, 23, 43, 35], designing effective approximations.\n\nDeep Potentials: After impressive results were demonstrated by Krizhevsky et al. [25] on the\nImageNet dataset [12], deep nets gained a signi\ufb01cant amount of attention. Alvarez et al. [1], Chen\net al. [8], Song et al. [48] took advantage of accurate local classi\ufb01cation for tasks such as semantic\nsegmentation by combining deep nets with graphical models in a two step procedure. More speci\ufb01-\ncally, deep nets were \ufb01rst trained to produce local evidence (see Fig. 1a). In a second training step\nlocal evidence was \ufb01xed and correlations were learned. While leading to impressive results, a two\nstep training procedure seemed counterintuitive and Tompson et al. [53], Zheng et al. [59], Chen\u2217\net al. [9], Schwing and Urtasun [42], Lin et al. [29] proposed a uni\ufb01ed formulation (see Fig. 1b) which\nwas subsequently shown to perform well on tasks such as semantic segmentation, image tagging etc.\nOur proposed approach is different in that we combine deep potentials with another deep net that is\nable to transform the inferred output space or features thereof.\n\nAutoregressive Models: Another approach to solve structured prediction problems using deep\nnets de\ufb01nes an order over the output variables and predicts one variable at a time, conditioned on\nthe previous ones. This approach relies on the chain rule, where the conditional is modeled with a\nrecurrent deep net (RNN). It has achieved impressive results in machine translation [51, 28], computer\nvision [38], and multi-label classi\ufb01cation [36]. The success of these methods ultimately depends on\nthe ability of the neural net to model the conditional distribution, and they are often sensitive to the\norder in which variables are processed. In contrast, we use a more direct way of modeling structure\nand a more global approach to inference by predicting all variables together.\n\nSPENs: Most related to our approach is the recent work of Belanger and McCallum [3], Belanger\net al. [4], which introduced structured prediction energy networks (SPENs) with the goal to address\nthe inductive bias. More speci\ufb01cally, Belanger and McCallum [3] observed that automatically learning\nthe structure of deep nets leads to improved results. To optimize the resulting objective, a relaxation of\nthe discrete output variables to the unit interval was applied and stochastic gradient descent or entropic\nmirror descent were used. Similar in spirit but more practically oriented is work by Nguyen et al.\n[37]. These approaches are illustrated in Fig. 1c. Despite additional improvements [4], optimization\nof the proposed approach remains challenging due to the non-convexity which may cause the output\nspace variables to get stuck in local optima.\n\u2018Deep value networks,\u2019 proposed by Gygli et al. [15] are another gradient based approach which uses\nthe same architecture and relaxation as SPENs. However, the training objective is inspired by value\nbased reinforcement learning.\nOur proposed method differs in two ways. First, we maintain the possibility to explicitly encourage\nstructure inside deep nets. Hence our approach extends SPENs by including additional modeling\ncapabilities. Second, instead of using a continuous relaxation of the output space variables, we\nformulate inference via a Lagrangian. Due to this formulation we can apply any of the existing\ninference mechanisms from belief propagation [39] and all its variants [31, 16] to LP relaxations [57].\nMore importantly, this also allows us (1) to naturally handle problems that are more general than\nmulti-label classi\ufb01cation; and (2) to use standard structured loss functions, rather than having to\nextend them to continuous variables, as SPENs do.\n\n3 Model Description\n\nvideo or volumetric data. Let x = (x1, . . . , xK) \u2208 X = (cid:81)K\n\nFormally, let c denote input data that is available for conditioning, for example sentences, images,\nk=1 Xk denote the multi-variate\noutput space with xk \u2208 Xk, k \u2208 {1, . . . , K} indicating a single variable de\ufb01ned on the domain\nXk \u2208 {1, . . . ,|Xk|}, assumed to be discrete. Generally, inference amounts to \ufb01nding the con\ufb01guration\n\n3\n\n\fx\u2217 = argmaxx\u2208X F (x, c, w) which maximizes a score F (x, c, w) that depends on the condition c,\nthe con\ufb01guration x and some model parameters w.\nClassical deep nets assume variables to be independent of each other (given the context). Hence, the\nk=1 fk(xk, c, w), each depending\nonly on a single xk. Due to this decomposition, inference is easily possible by optimizing each\nfk(xk, c, w) w.r.t. xk independently of the other ones. Such a model is illustrated in Fig. 1a. It is\nhowever immediately apparent that this approach doesn\u2019t explicitly take correlations between any\npair of variables into account.\n\nscore decomposes into a sum of local functions F (x, c, w) =(cid:80)K\nTo model such context more globally, the score F (x, c, w) =(cid:80)\n\nr\u2208R fr(xr, c, w) is composed of\noverlapping local functions fr(xr, c, w) that are no longer restricted to depend on only a single vari-\nable xk. Rather does fr depend on arbitrary subsets of variables xr = (xk)k\u2208r with r \u2286 {1, . . . , K}.\nThe set R subsumes all subsets r \u2208 R that are required to describe the score for the considered\ntask. Finding the highest scoring con\ufb01guration x\u2217 for this type of function generally requires global\ninference, which is NP-hard [47]. It is common to resort to well-studied approximations [57], unless\nexact techniques such as dynamic programming or submodular optimization [41, 30, 50, 21] are\napplicable. The complexity of those approximations increases with the size of the largest variable\nindex subset r. Therefore, many of the models considered to date do not exceed pairwise interactions.\nThis is shown in Fig. 1b.\n\nBeyond this restriction to low-order locality, the score function F (x, c, w) = (cid:80)\n\nr\u2208R fr(xr, c, w)\nbeing expressed as a sum is itself a limitation. It is this latter restriction which we address directly in\nthis work. However, we emphasize that the employed non-linear output space transformations are\nable to extract non-local high-order correlations implicitly, hence we address locality indirectly.\nTo alleviate the restriction of the score function being a sum, and to implicitly enable high-order\ninteractions while modeling structure, our framework extends the aforementioned score via a non-\nlinear transformation of its output, formally,\n\nF (x, c, w) = T (c, H(x, c, w), w) .\n\n(1)\nThis is illustrated as a general concept in Fig. 1d and with the speci\ufb01c model used by our experiments\nin Fig. 2. We use T to denote the additional (top) non-linear output transformation. Parameters w may\nor may not be shared between bottom and top layers, i.e., we view w as a long vector containing all\ntrainable model weights. Different from structured deep nets, where F is required to be real-valued,\nH may be vector-valued. In this work, H is a vector where each entry represents the score fr(xr, c, w)\nr\u2208R |Xr| entries;\nhowever, other forms are possible. It is immediately apparent that1 T = 1(cid:62)H yields the classical\nr\u2208R fr(xr, c, w) and other more complex and in particular deep net\n\nfor a given region r and assignment to that region xr, i.e., the vector H has(cid:80)\nscore function F (x, c, w) =(cid:80)\n\nbased transformations T are directly applicable.\nFurther note that for deep net based transformations T , x is no longer part of the outer-most function,\nmaking the proposed approach more general than existing methods. Particularly, the \u2018output space\u2019\ncon\ufb01guration x is obtained inside a deep net, consisting of the bottom part H and the top part T . This\ncan be viewed as a structure-layer, a natural way to represent meaningful features in the intermediate\nnodes of the network. Also note that SPENs [3] can be viewed as a special case of our approach\n(ignoring optimization). Speci\ufb01cally, we obtain the SPEN formulation when H consists of purely\nlocal scoring functions, i.e., when Hk = fk(xk, c, w). This is illustrated in Fig. 1c.\nGenerality has implications on inference and learning. Speci\ufb01cally, inference, i.e., solving the\nprogram\n\nx\u2217 = argmax\n\nx\u2208X\n\nT (c, H(x, c, w), w)\n\n(2)\n\ninvolves back-propagation through the non-linear output transformation T . Note that back-\npropagation through T encodes top-down information into the inferred con\ufb01guration, while forward\npropagation through H provides a classical bottom-up signal. Because of the top-down information\nwe say that global structure is implicitly modeled. Alternatively, T can be thought of as an adjustable\nloss function which matches predicted scores H to data c.\nUnlike previous structured models, the scoring function presented in Eq. (1) does not decompose\nacross the regions in R. As a result, inference techniques developed previously for structured models\n\n11 denotes the all ones vector.\n\n4\n\n\fAlgorithm 1 Inference Procedure\n1: Input: Learning rates \u03b1y, \u03b1\u03bb; y0; \u03bb0; number of iterations n\n2: \u00b5\u2217 \u21d0 argmin\u02c6\u00b5 H D(\u02c6\u00b5, c, \u03bb, w)\n3: \u00af\u03bb \u21d0 \u03bb0\n4: y1 \u21d0 y0\n(cid:0)yi \u2212 yi\u22121 + \u03b1y\n\u00af\u03bb(cid:1) \u2212 \u2207yT (c, y, w)\n5: for i = 1 to n do\nrepeat\n6:\n(cid:0)\u2207\u03bbH D(\u00b5\u2217, c, \u03bb, w) \u2212 yi\n(cid:1)\n7:\n8:\n9:\n(cid:80)n\n(cid:80)n\n10:\n11: end for\ni=n/2 \u03bbi ; y \u21d0 2\n12: \u03bb \u21d0 2\n13: \u00b5 \u21d0 argmin\u02c6\u00b5 H D(\u02c6\u00b5, c, \u00af\u03bb, w)\n14: Return: \u00b5, \u03bb, y\n\nuntil convergence\n\u03bbi \u21d0 \u03bbi\u22121 \u2212 \u03b1\u03bb\n\u00af\u03bb = 2\u03bbi \u2212 \u03bbi\u22121; yi+1 \u21d0 yi\n\ni=n/2 yi\n\nyi \u21d0 1\n\n\u03b1y\n\nn\n\nn\n\ndo not apply directly here, and new techniques must be developed. To optimize the program given\nin Eq. (2), for continuous variables x, gradient descent via back-propagation is applicable. In the\nabsence of any other strategy, for discrete x, SPENs apply a continuous relaxation where constraints\nrestrict the domain. However, no guarantees are available for this form of optimization, even if\nmaximization over the output space and back-propagation are tractable computationally. Additionally,\nprojection into X is nontrivial here due to the additional structured constraints. To obtain consistency\nwith existing structured deep net formulations and to maintain applicability of classical inference\nmethods such as dynamic programming and LP relaxations, in the following, we discuss an alternative\nformulation for both inference and learning.\n\n3.1\n\nInference\n\nWe next describe a technique to optimize structured deep nets augmented by non-linear output space\ntransformations. This method is compelling because existing frameworks for graphical models\ncan be deployed. Importantly, optimization over computationally tractable output spaces remains\ncomputationally tractable in this formulation. To achieve this goal, a dual-decomposition based\nLagrangian technique is used to split the objective into two interacting parts, optimized with an\nalternating strategy. The resulting inference program is similar in spirit to inference problems derived\nin other contexts using similar techniques (see, for example, [24]). Formally, note that the inference\ntask considered in Eq. (2) is equivalent to the following constrained program:\n\nmax\nx\u2208X ,y\n\nT (c, y, w)\n\ns.t.\n\ny = H(x, c, w),\n\n(3)\n\nwhere the variable y may be a vector of scores. By introducing Lagrange multipliers \u03bb, the proposed\nobjective is reformulated into the following saddle-point problem:\n\n(cid:8)T (c, y, w) \u2212 \u03bbT y(cid:9) + max\n\n(cid:19)\n\n(cid:18)\n\nmin\n\n\u03bb\n\nmax\n\ny\n\nx\u2208X \u03bbT H(x, c, w)\n\n.\n\n(4)\n\nTwo advantages of the resulting program are immediately apparent. Firstly, the objective for the\nmaximization over the output space X , required for the second term in parentheses, decomposes\nlinearly across the regions in R. As a result, this subproblem can be tackled with classical techniques,\nsuch as dynamic programming, message passing, etc., for which a great amount of literature is\nreadily available [e.g., 58, 6, 56, 14, 49, 24, 16, 43, 44, 45, 22, 34, 32]. Secondly, maximization\nover the output space X is connected to back-propagation only via Lagrange multipliers. Therefore,\nback-propagation methods can run independently. Here, we optimize over X by following Hazan\net al. [18], Chen\u2217 et al. [9], using a message passing formulation based on an LP relaxation of the\noriginal program.\nSolving inference requires \ufb01nding the saddle point of Eq. (4) over \u03bb, y, and x. However, the fact that\nmaximization with respect to the output space is a discrete optimization problem complicates this\n\n5\n\n\fg = 0\nfor every datapoint in a minibatch do\n\nAlgorithm 2 Weight Update Procedure\n1: Input: Learning rate \u03b1, \u02c6y, \u02c6\u03bb, and D\n2: for i = 1 to n do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nend for\nw \u21d0 w \u2212 \u03b1 (Cw + g)\n\n\u02c6x \u21d0 Inference in Algorithm 1 (adding L(x, \u02c6x))\ng \u21d0 g + \u2207w (T (c, H(\u02c6x, c, w), w) \u2212 T (c, H(x, c, w), w))\n\nprocess somewhat. To simplify this, we follow the derivation in [18, 9] by dualizing the LP relaxation\nproblem to convert maximization over X into a minimization over dual variables \u00b5. This allows us to\nrewrite inference as follows:\n\n(cid:8)T (c, y, w) \u2212 \u03bbT y(cid:9) + H D(\u00b5, c, \u03bb, w)\n\n(cid:19)(cid:19)\n\n(cid:18)\n\n(cid:18)\n\nmin\n\n\u00b5\n\nmin\n\n\u03bb\n\nmax\n\ny\n\n,\n\n(5)\n\nwhere H D(\u00b5, c, \u03bb, w) is the relaxed dual objective of the original discrete optimization problem. The\nalgorithm is summarized in Alg. 1. See Section 3.3 for discussion of the approach taken to optimize\nthe saddle point.\nFor arbitrary region decompositions R and potential transformations T , inference can only be\nguaranteed to converge to local optima of the optimization problem. There do exist choices for R\nand T , however, where global convergence guarantees can be attained \u2013 speci\ufb01cally, if R forms a\ntree [39] and if T is concave in y (which can be attained, for example, using an input-convex neural\nnetwork [2]). We leave exploration of the impact of local versus global inference convergence on\nmodel performance for future work. For now, we note that the experimental results presented in\nSection 4 imply that inference converges suf\ufb01ciently well in practice for this model to make better\npredictions than the baselines.\n\n3.2 Learning\n\nWe formulate the learning task using the common framework for structured support vector machines\n[52, 54]. Given an arbitrary scoring function F , we \ufb01nd optimal weights w by maximizing the\nmargin between the score assigned to the ground-truth con\ufb01guration and the highest-scoring incorrect\ncon\ufb01guration:\n\n\u2212F (x, c, w) .\n\n(6)\n\n(cid:88)\n\nmin\n\nw\n\n(x,c)\u2208D\n\nmax\n\n(cid:124)\n(cid:125)\n\u02c6x\u2208X {F (\u02c6x, c, w) + L(x, \u02c6x)}\n\n(cid:123)(cid:122)\n\nLoss augmented inference\n\nThis formulation applies to any scoring function F , and we can therefore substitute in the program\ngiven in Eq. (1) to arrive at the \ufb01nal learning objective:\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)\n\n(x,c)\u2208D\n\nmin\n\nw\n\nC\n2\n\n(cid:107)w(cid:107)2\n\n2 +\n\nmax\n\n(cid:124)\n(cid:125)\n\u02c6x\u2208X {T (c, H(\u02c6x, c, w), w)+L(x, \u02c6x)}\n\n(cid:123)(cid:122)\n\nLoss augmented inference\n\n\u2212T (c, H(x, c, w), w)\n\n.\n\n(7)\n\nTo solve loss augmented inference we follow the dual-decomposition based derivation discussed\nin Sec. 3.1. In short, we replace loss augmented inference with the program obtained in Eq. (5) by\nadding the loss term L(x, \u02c6x). This requires the loss to decompose according to R, which is satis\ufb01ed\nby many standard losses (e.g., the Hamming loss). Note that beyond extending SPEN, the proposed\nlearning formulation is less restrictive than SPEN since we don\u2019t assume the loss L(x, \u02c6x) to be\ndifferentiable w.r.t. x.\nWe optimize the program given in Eq. (7) by alternating between inference to update \u03bb, y, and \u00b5\nand taking gradient steps in w. Note that the speci\ufb01ed formulation contains the additional bene\ufb01t of\nallowing for the interleaving of optimization w.r.t. \u00b5 and w.r.t. model parameters w, though we leave\nthis exploration to future work. Since inference and learning are based on a saddle-point formulation,\nspeci\ufb01c attention has to be paid to ensure convergence to the desired values. We discuss those details\nsubsequently.\n\n6\n\n\f(a) Word recognition datapoints\n\n(b) Segmentation datapoints\n\nFigure 3: Sample datapoints for experiments\n\nTable 1: Results for word recognition experiments. The two numbers per entry represent the word\nand character accuracies, respectively.\n\nChain\n\nSecond-order\n\nTrain\n\nTest\n\nTrain\n\nTest\n\nUnary\nDeepStruct\nLinearTop\nNLTop\n\n0.003\n0.077\n0.137\n0.156\n\n0.2946\n0.4548\n0.5308\n0.5464\n\n0.000\n0.040\n0.085\n0.075\n\n0.2350\n0.3460\n0.4030\n0.4150\n\n0.084\n0.164\n0.242\n\n0.4528\n0.5386\n0.5828\n\n0.030\n0.090\n0.140\n\n0.3220\n0.4090\n0.4420\n\n3.3\n\nImplementation Details\n\n2 iterates of y and \u03bb. The overall inference procedure is outlined in Alg. 1.\n\nWe use the primal-dual algorithm from [7] to solve the saddle point problem in Eq. (5). Though an\naveraging scheme is not speci\ufb01ed, we observe better convergence in practice by averaging over the\nlast n\nFor learning, we select a minibatch of data at every iteration. For all samples in the minibatch we\n\ufb01rst perform loss augmented inference following Alg. 1 modi\ufb01ed by adding the loss. Every round of\ninference is followed by an update of the weights of the model, which is accomplished via gradient\ndescent. This process is summarized in Alg. 2. Note that the current gradient depends on the model\nestimates for \u02c6x, \u03bb, and y.\nWe implemented this non-linear structured deep net model using the PyTorch framework.2 Our\nimplementation allows for the usage of arbitrary higher-order graphical models, and it allows for an\narbitrary composition of graphical models within the H vector speci\ufb01ed previously. The message\npassing implementation used to optimize over the discrete space X is in C++ and integrated into the\nmodel code as a python extension.\n\n4 Experiments\n\nWe evaluate our non-linear structured deep net model on several diverse tasks: word recognition,\nimage tagging, multilabel classi\ufb01cation, and semantic segmentation. For these tasks, we trained\nmodels using some or all of the following con\ufb01gurations: Unary consists of a deep network model\ncontaining only unary potentials. DeepStruct consists of a deep structured model [9]; unless\notherwise speci\ufb01ed, these were trained by \ufb01xing the pretrained Unary potentials and learning\npairwise potentials. For all experiments, these potentials have the form fi,j(xi, xj, W ) = Wxi,xj ,\nwhere Wxi,xj is the (xi, xj)-th element of the weight matrix W and i,j are a pair of nodes in the\ngraph. For the word recognition and segmentation experiments, the pairwise potentials are shared\nacross every pair, and in the others, unique potentials are learned for every pair. These unary and\npairwise potentials are then \ufb01xed, and a \u201cTop\u201d model is trained using them; LinearTop consists\nof a structured deep net model with linear T , i.e. T (c, y, w) = wT y, while NLTop consists of a\nstructured deep net model where the form of T is task-speci\ufb01c. For all experiments, additional details\nare discussed in Appendix A.1, including speci\ufb01c architectural details and hyperparameter settings.\n\nWord Recognition: Our \ufb01rst set of experiments were run on a synthetic word recognition dataset.\nThe dataset was constructed by taking a list of 50 common \ufb01ve-letter English words, e.g., \u2018close,\u2019\n\n2Code available at: https://github.com/cgraber/NLStruct.\n\n7\n\n\fTable 2: Results for image tagging experiments.\nAll values are hamming losses.\n\nUnary\nDeepStruct\nDeepStruct++\nSPENInf\nNLTop\n\nTrain Validation\n1.670\n1.135\n1.139\n1.121\n1.111\n\n2.176\n2.045\n2.003\n2.016\n1.976\n\nTest\n2.209\n2.045\n2.057\n2.061\n2.038\n\nTable 3: Results for segmentation experiments.\nAll values are mean intersection-over-union\nTest\n0.7100\n0.7219\n0.7525\n0.7522\n0.8633\n\nUnary\nDeepStruct\nSPENInf\nNLTop\nOracle\n\nTrain\n0.8005\n0.8216\n0.8585\n0.8542\n0.9260\n\nValidation\n\n0.7266\n0.7334\n0.7542\n0.7552\n0.8792\n\n\u2018other,\u2019 and \u2018world,\u2019 and rendering each letter as a 28x28 pixel image. This was done by selecting a\nrandom image of each letter from the Chars74K dataset [11], randomly rotating, shifting, and scaling\nthem, and then inserting them into random background patches with high intensity variance. The\ntask is then to identify each word from the \ufb01ve letter images. The training, validation, and test sets\nfor these experiments consist of 1,000, 200, and 200 words, respectively, generated in this way. See\nFig. 3a for sample words from this dataset.\nHere, Unary consists of a two-layer perceptron trained using a max-margin loss on the individual\nletter images as a 26-way letter classi\ufb01er. Both LinearTop and NLTop models were trained for\nthis task, the latter of which consist of 2-layer sigmoidal multilayer perceptrons. For all structured\nmodels, two different graphs were used: each contains \ufb01ve nodes, one per letter in each word. The\n\ufb01rst contains four pair edges connecting each adjacent letter, and the second additionally contains\nsecond-order edges connecting letters to letters two positions away. Both graph con\ufb01gurations of the\nLinearTop and NLTop models \ufb01nished 400 epochs of training in approximately 2 hours.\nThe word and character accuracy results for these experiments are presented in Tab. 1. We observe\nthat, for both graph types, adding structure improves model performance. Additionally, including\na global potential transformation increases performance further, and this improvement is increased\nwhen the transformation is nonlinear.\n\nMultilabel Classi\ufb01cation: For this set of experiments, we compare against SPENs on the Bibtex\nand Bookmarks datasets used by Belanger and McCallum [3] and Tu and Gimpel [55]. These datasets\nconsist of binary feature vectors, each of which is assigned some subset of 159/208 possible labels,\nrespectively. 500/1000 pairs were chosen for the structured models for Bibtex and Bookmarks,\nrespectively, by selecting the labels appearing most frequently together within the training data.\nOur Unary model obtained macro-averaged F1 scores of 44.0 and 38.4 on the Bibtex and Bookmarks\ndatasets, respectively; DeepStruct and NLStruct performed comparably. Note that these classi\ufb01er\nscores outperform the SPEN results reported in Tu and Gimpel [55] of 42.4 and 34.4, respectively.\n\nImage Tagging: Next, we train image tagging models using the MIRFLICKR25k dataset [20]. It\nconsists of 25,000 images taken from Flickr, each of which are assigned some subset of a possible 24\ntags. The train/development/test sets for these experiments consisted of 10,000/5,000/10,000 images,\nrespectively.\nHere, the Unary classi\ufb01er consists of AlexNet [25], \ufb01rst pre-trained on ImageNet and then \ufb01ne-\ntuned on the MIRFLICKR25k data. For DeepStruct, both the unary and pairwise potentials were\ntrained jointly. A fully connected pairwise graphical model was used, with one binary node per label\nand an edge connecting every pair of labels. Training of the NLStruct model was completed in\napproximately 9.2 hours.\nThe results for this set of experiments are presented in Tab. 2. We observe that adding explicit\nstructure improves a non-structured model and that adding implicit structure through T improves\nan explicitly structured model. We additionally compare against a SPEN-like inference procedure\n(SPENInf) as follows: we load the trained NLTop model and \ufb01nd the optimal output structure\nmaxx\u2208X T (c, H(c, w, w), w) by relaxing x to be in [0, 1]24 and using gradient ascent (the \ufb01nal\noutput is obtained by rounding). We observe that using this inference procedure provides inferior\nresults to our approach.\n\n8\n\n\fTo verify that the improved results for NLTop are not the result of an increased number of parameters,\nwe additionally trained another DeepStruct model containing more parameters, which is called\nDeepStruct++ in Table 2. Speci\ufb01cally, we \ufb01xed the original DeepStruct potentials and learned two\nadditional 2-layer multilayer perceptrons that further transformed the unary and pairwise potentials.\nNote that this model adds approximately 1.8 times more parameters than NLTop (i.e., 2,444,544 vs.\n1,329,408) but performs worse. NLTop can capture global structure that may be present in the data\nduring inference, whereas DeepStruct only captures local structure duing inference.\n\nSemantic Segmentation: Finally, we run foreground-background segmentation on the Weizmann\nHorses database [5], consisting of 328 images of horses paired with segmentation masks (see\nFig. 3b for example images). We use train/validation/test splits of 196/66/66 images, respectively.\nAdditionally, we scale the input images such that the smaller dimension is 224 pixels long and take a\ncenter crop of 224x224 pixels; the same is done for the masks, except using a length of 64 pixels. The\nUnary classi\ufb01er is similar to FCN-AlexNet from [46], while NLStruct consists of a convolutional\narchitecture built from residual blocks [19]. We additionally train a model with similar architecture to\nNLStruct where ground-truth labels are included as an input into T (Oracle). Here, the NLStruct\nmodel required approximately 10 hours to complete 160 training epochs.\nTab. 3 displays the results for this experiment. Once again, we observe that adding the potential\ntransformation T is able to improve task performance. The far superior performance by the Oracle\nmodel validates our approach, as it suggests that our model formulation has the capacity to take a \ufb01xed\nset of potentials and rebalance them in such a way that performs better than using those potentials\nalone. We also evaluate the model using the same SPEN-like inference procedure as described in the\nthe Image Tagging experiment (SPENInf). In this case, both approaches performed comparably.\n\n5 Conclusion and Future Work\n\nIn this work we developed a framework for deep structured models which allows for implicit modeling\nof higher-order structure as an intermediate layer in the deep net. We showed that our approach\ngeneralizes existing models such as structured prediction energy networks. We also discussed an\noptimization framework which retains applicability of existing inference engines such as dynamic\nprogramming or LP relaxations. Our approach was shown to improve performance on a variety of\ntasks over a base set of potentials.\nMoving forward, we will continue to develop this framework by investigating other possible archi-\ntectures for the top network T and investigating other methods of solving inference. Additionally,\nwe hope to assess this framework\u2019s applicability on other tasks. In particular, the tasks chosen for\nexperimentation here contained \ufb01xed-size output structures; however, it is common for the outputs\nfor structured prediction tasks to be of variable size. This requires different architectures for T than\nthe ones considered here.\nAcknowledgments: This material is based upon work supported in part by the National Science\nFoundation under Grant No. 1718221, Samsung, 3M, and the IBM-ILLINOIS Center for Cognitive\nComputing Systems Research (C3SR). We thank NVIDIA for providing the GPUs used for this\nresearch.\n\nReferences\n[1] J. Alvarez, Y. LeCun, T. Gevers, and A. Lopez. Semantic road segmentation via multi-scale\n\nensembles of learned features. In Proc. ECCV, 2012.\n\n[2] B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In Proc. ICML, 2017.\n[3] D. Belanger and A. McCallum. Structured Prediction Energy Networks. In Proc. ICML, 2016.\n[4] D. Belanger, B. Yang, and A. McCallum. End-to-end learning for structured prediction energy\n\nnetworks. In Proc. ICML, 2017.\n\n[5] E. Borenstein and S. Ullman. Class-speci\ufb01c, top-down segmentation. In Proc. ECCV, 2002.\n[6] Y. Boykov, O. Veksler, and R. Zabih. Fast Approximate Energy Minimization via Graph Cuts.\n\nPAMI, 2001.\n\n9\n\n\f[7] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with\n\napplications to imaging. Journal of mathematical imaging and vision, 2011.\n\n[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image\n\nsegmentation with deep convolutional nets and fully connected crfs. In Proc. ICLR, 2015.\n\n[9] L.-C. Chen\u2217, A. G. Schwing\u2217, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models.\n\nIn Proc. ICML, 2015. \u2217equal contribution.\n\n[10] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction. arXiv\n\npreprint arXiv:1806.02402, 2018.\n\n[11] T. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. 2009.\n[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A Large-Scale\n\nHierarchical Image Database. In Proc. CVPR, 2009.\n\n[13] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In\n\nProc. ICML, 2008.\n\n[14] A. Globerson and T. Jaakkola. Approximate Inference Using Planar Graph Decomposition. In\n\nProc. NIPS, 2006.\n\n[15] M. Gygli, M. Norouzi, and A. Angelova. Deep value networks learn to evaluate and iteratively\n\nre\ufb01ne structured outputs. In Proc. ICML, 2017.\n\n[16] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for\n\nLP-Relaxation and Approximate Inference. Trans. Information Theory, 2010.\n\n[17] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large\n\nScale Structured Prediction. In Proc. NIPS, 2010.\n\n[18] T. Hazan, A. G. Schwing, and R. Urtasun. Blending Learning and Inference in Conditional\n\nRandom Fields. JMLR, 2016.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proc.\n\nCVPR, 2016.\n\n[20] M. J. Huiskes and M. S. Lew. The mir \ufb02ickr retrieval evaluation. In Proc. ACM international\n\nconference on Multimedia information retrieval. ACM, 2008.\n\n[21] S. Jegelka, H. Lin, and J. Bilmes. On Fast Approximate Submodular Minimization. In Proc.\n\nNIPS, 2011.\n\n[22] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schn\u00f6rr, S. Nowozin, D. Batra, S. Kim, B. X.\nKausler, T. Kr\u00f6ger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative\nstudy of modern inference techniques for structured discrete energy minimization problems.\nIJCV, 2015.\n\n[23] N. Komodakis. Ef\ufb01cient training for pairwise or higher order crfs via dual decomposition. In\n\nProc. CVPR, 2011.\n\n[24] N. Komodakis and N. Paragios. Beyond pairwise energies: Ef\ufb01cient optimization for higher-\n\norder mrfs. In Proc. CVPR, 2009.\n\n[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In Proc. NIPS, 2012.\n\n[26] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. NIPS,\n\n2008.\n\n[27] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for\n\nsegmenting and labeling sequence data. In Proc. ICML, 2001.\n\n[28] R. Leblond, J.-B. Alayrac, A. Osokin, and S. Lacoste-Julien. SEARNN: Training RNNs with\n\nglobal-local losses. In International Conference on Learning Representations, 2018.\n\n[29] G. Lin, C. Shen, I. Reid, and A. van den Hengel. Deeply learning the messages in message\n\npassing inference. In Proc. NIPS, 2015.\n\n[30] S. T. McCormick. Submodular Function Minimization, pages 321\u2013391. Elsevier, 2008.\n[31] T. Meltzer, A. Globerson, and Y. Weiss. Convergent Message Passing Algorithms: A Unifying\n\nView. In Proc. UAI, 2009.\n\n10\n\n\f[32] O. Meshi and A. Schwing. Asynchronous parallel coordinate minimization for map inference.\n\nIn Proc. NIPS, 2017.\n\n[33] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning Ef\ufb01ciently with Approximate\n\nInference via Dual Losses. In Proc. ICML, 2010.\n\n[34] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and Strong: MAP Inference with Linear\n\nConvergence. In Proc. NIPS, 2015.\n\n[35] O. Meshi, M. Mahdavi, A. Weller, and D. Sontag. Train and test tightness of LP relaxations in\n\nstructured prediction. In Proc. ICML, 2016.\n\n[36] J. Nam, E. Loza Menc\u00eda, H. J. Kim, and J. F\u00fcrnkranz. Maximizing subset accuracy with\n\nrecurrent neural networks in multi-label classi\ufb01cation. In Proc. NIPS, 2017.\n\n[37] K. Nguyen, C. Fookes, and S. Sridharan. Deep Context Modeling for Semantic Segmentation.\n\nIn WACV, 2017.\n\n[38] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\n\nProc. ICML, 2016.\n\n[39] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. In Proc.\n\nAAAI, 1982.\n\n[40] P. Pletscher, C. S. Ong, and J. M. Buhmann. Entropy and Margin Maximization for Structured\n\nOutput Learning. In Proc. ECML PKDD, 2010.\n\n[41] A. Schrijver. Combinatorial Optimization. Springer, 2004.\n[42] A. G. Schwing and R. Urtasun.\n\nhttps://arxiv.org/abs/1503.02351, 2015.\n\nFully Connected Deep Structured Networks.\n\nIn\n\n[43] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large\n\nScale Graphical Models. In Proc. CVPR, 2011.\n\n[44] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP\n\nRelaxation Solvers Using Fenchel-Young Margins. In Proc. NIPS, 2012.\n\n[45] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Parallel MAP LP\n\nRelaxation Solver Using the Frank-Wolfe Algorithm. In Proc. ICML, 2014.\n\n[46] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nPAMI, 2016.\n\n[47] S. E. Shimony. Finding MAPs for Belief Networks is NP-hard. Arti\ufb01cial Intelligence, 1994.\n[48] Y. Song, A. G. Schwing, R. Zemel, and R. Urtasun. Training Deep Neural Networks via Direct\n\nLoss Minimization. In Proc. ICML, 2016.\n\n[49] D. Sontag, T. Meltzer, A. Globerson, and T. Jaakkola. Tightening LP Relaxations for MAP\n\nUsing Message Passing. In Proc. NIPS, 2008.\n\n[50] P. Stobbe and A. Krause. Ef\ufb01cient Minimization of Decomposable Submodular Functions. In\n\nProc. NIPS, 2010.\n\n[51] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks.\n\nIn Proc. NIPS, 2014.\n\n[52] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003.\n[53] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint Training of a Convolutional Network and\n\na Graphical Model for Human Pose Estimation. In Proc. NIPS, 2014.\n\n[54] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured\n\nand Interdependent Output Variables. JMLR, 2005.\n\n[55] L. Tu and K Gimpel. Learning approximate inference networks for structured prediction. In\n\nProc. ICLR, 2018.\n\n[56] M. J. Wainwright and M. I. Jordan. Variational Inference in Graphical Models: The View from\n\nthe Marginal Polytope. In Proc. Conf. on Control, Communication and Computing, 2003.\n\n[57] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families and Variational\n\nInference. Foundations and Trends in Machine Learning, 2008.\n\n11\n\n\f[58] T. Werner. A Linear Programming Approach to Max-Sum Problem: A Review. PAMI, 2007.\n[59] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S.\n\nTorr. Conditional random \ufb01elds as recurrent neural networks. In Proc. ICCV, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3122, "authors": [{"given_name": "Colin", "family_name": "Graber", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Ofer", "family_name": "Meshi", "institution": "Google"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}]}