{"title": "A Regularized Framework for Sparse and Structured Neural Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 3338, "page_last": 3348, "abstract": "Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input.  We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input.  We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation.  To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization.  Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.", "full_text": "A Regularized Framework for\n\nSparse and Structured Neural Attention\n\nVlad Niculae\u2217\nCornell University\n\nIthaca, NY\n\nvlad@cs.cornell.edu\n\nMathieu Blondel\n\nNTT Communication Science Laboratories\n\nKyoto, Japan\n\nmathieu@mblondel.org\n\nAbstract\n\nModern neural networks are often augmented with an attention mechanism, which\ntells the network where to focus within the input. We propose in this paper a\nnew framework for sparse and structured attention, building upon a smoothed\nmax operator. We show that the gradient of this operator de\ufb01nes a mapping from\nreal values to probabilities, suitable as an attention mechanism. Our framework\nincludes softmax and a slight generalization of the recently-proposed sparsemax as\nspecial cases. However, we also show how our framework can incorporate modern\nstructured penalties, resulting in more interpretable attention mechanisms, that\nfocus on entire segments or groups of an input. We derive ef\ufb01cient algorithms to\ncompute the forward and backward passes of our attention mechanisms, enabling\ntheir use in a neural network trained with backpropagation. To showcase their\npotential as a drop-in replacement for existing ones, we evaluate our attention\nmechanisms on three large-scale tasks: textual entailment, machine translation, and\nsentence summarization. Our attention mechanisms improve interpretability with-\nout sacri\ufb01cing performance; notably, on textual entailment and summarization, we\noutperform the standard attention mechanisms based on softmax and sparsemax.\n\n1\n\nIntroduction\n\nModern neural network architectures are commonly augmented with an attention mechanism, which\ntells the network where to look within the input in order to make the next prediction. Attention-\naugmented architectures have been successfully applied to machine translation [2, 29], speech\nrecognition [10], image caption generation [44], textual entailment [38, 31], and sentence summariza-\ntion [39], to name but a few examples. At the heart of attention mechanisms is a mapping function\nthat converts real values to probabilities, encoding the relative importance of elements in the input.\nFor the case of sequence-to-sequence prediction, at each time step of generating the output sequence,\nattention probabilities are produced, conditioned on the current state of a decoder network. They are\nthen used to aggregate an input representation (a variable-length list of vectors) into a single vector,\nwhich is relevant for the current time step. That vector is \ufb01nally fed into the decoder network to\nproduce the next element in the output sequence. This process is repeated until the end-of-sequence\nsymbol is generated. Importantly, such architectures can be trained end-to-end using backpropagation.\n\nAlongside empirical successes, neural attention\u2014while not necessarily correlated with human\nattention\u2014is increasingly crucial in bringing more interpretability to neural networks by help-\ning explain how individual input elements contribute to the model\u2019s decisions. However, the most\ncommonly used attention mechanism, softmax, yields dense attention weights: all elements in the in-\nput always make at least a small contribution to the decision. To overcome this limitation, sparsemax\nwas recently proposed [31], using the Euclidean projection onto the simplex as a sparse alternative to\n\n\u2217Work performed during an internship at NTT Commmunication Science Laboratories, Kyoto, Japan.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Attention weights produced by the proposed fusedmax, compared to softmax and sparsemax,\non sentence summarization. The input sentence to be summarized (taken from [39]) is along the\nx-axis. From top to bottom, each row shows where the attention is distributed when producing\neach word in the summary. All rows sum to 1, the grey background corresponds to exactly 0 (never\nachieved by softmax), and adjacent positions with exactly equal weight are not separated by borders.\nFusedmax pays attention to contiguous segments of text with equal weight; such segments never\noccur with softmax and sparsemax. In addition to enhancing interpretability, we show in \u00a74.3 that\nfusedmax outperforms both softmax and sparsemax on this task in terms of ROUGE scores.\n\nsoftmax. Compared to softmax, sparsemax outputs more interpretable attention weights, as illustrated\nin [31] on the task of textual entailment. The principle of parsimony, which states that simple expla-\nnations should be preferred over complex ones, is not, however, limited to sparsity: it remains open\nwhether new attention mechanisms can be designed to bene\ufb01t from more structural prior knowledge.\n\nOur contributions. The success of sparsemax motivates us to explore new attention mechanisms\nthat can both output sparse weights and take advantage of structural properties of the input through\nthe use of modern sparsity-inducing penalties. To do so, we make the following contributions:\n\n1) We propose a new general framework that builds upon a max operator, regularized with a strongly\nconvex function. We show that this operator is differentiable, and that its gradient de\ufb01nes a mapping\nfrom real values to probabilities, suitable as an attention mechanism. Our framework includes as\nspecial cases both softmax and a slight generalization of sparsemax. (\u00a72)\n\n2) We show how to incorporate the fused lasso [42] in this framework, to derive a new attention\nmechanism, named fusedmax, which encourages the network to pay attention to contiguous segments\nof text when making a decision. This idea is illustrated in Figure 1 on sentence summarization. For\ncases when the contiguity assumption is too strict, we show how to incorporate an OSCAR penalty\n[7] to derive a new attention mechanism, named oscarmax, that encourages the network to pay equal\nattention to possibly non-contiguous groups of words. (\u00a73)\n\n3) In order to use attention mechanisms de\ufb01ned under our framework in an autodiff toolkit, two\nproblems must be addressed: evaluating the attention itself and computing its Jacobian. However,\nour attention mechanisms require solving a convex optimization problem and do not generally\nenjoy a simple analytical expression, unlike softmax. Computing the Jacobian of the solution of\nan optimization problem is called argmin/argmax differentiation and is currently an area of active\nresearch (cf. [1] and references therein). One of our key algorithmic contributions is to show how\nto compute this Jacobian under our general framework, as well as for fused lasso and OSCAR. (\u00a73)\n\n4) To showcase the potential of our new attention mechanisms as a drop-in replacement for existing\nones, we show empirically that our new attention mechanisms enhance interpretability while achieving\ncomparable or better accuracy on three diverse and challenging tasks: textual entailment, machine\ntranslation, and sentence summarization. (\u00a74)\n\nNotation. We denote the set {1, . . . , d} by [d]. We denote the (d \u2212 1)-dimensional probability\nsimplex by \u2206d := {x \u2208 Rd : kxk1 = 1, x \u2265 0} and the Euclidean projection onto it by P\u2206d (x) :=\narg miny\u2208\u2206d ky \u2212 xk2. Given a function f : Rd \u2192 R \u222a {\u221e}, its convex conjugate is de\ufb01ned by\nf \u2217(x) := supy\u2208dom f yTx\u2212f (y). Given a norm k\u00b7k, its dual is de\ufb01ned by kxk\u2217 := supkyk\u22641 yTx.\nWe denote the subdifferential of a function f at y by \u2202f (y). Elements of the subdifferential are\ncalled subgradients and when f is differentiable, \u2202f (y) contains a single element, the gradient of f\nat y, denoted by \u2207f (y). We denote the Jacobian of a function g : Rd \u2192 Rd at y by Jg(y) \u2208 Rd\u00d7d\nand the Hessian of a function f : Rd \u2192 R at y by Hf (y) \u2208 Rd\u00d7d.\n\n2\n\nrussiandefenseministerivanovcalledsundayforthecreationofajointfrontforcombatingglobalterrorism.russiandefenseministercallsforjointfrontagainstterrorism<EOS>fusedmaxrussiandefenseministerivanovcalledsundayforthecreationofajointfrontforcombatingglobalterrorism.softmaxrussiandefenseministerivanovcalledsundayforthecreationofajointfrontforcombatingglobalterrorism.sparsemax\fFigure 2: The proposed max\u2126(x) operator up to a constant (left) and the proposed \u03a0\u2126(x) mapping\n(right), illustrated with x = [t, 0] and \u03b3 = 1. In this case, max\u2126(x) is a ReLu-like function and\n\u03a0\u2126(x) is a sigmoid-like function. Our framework recovers softmax (negative entropy) and sparsemax\n(squared 2-norm) as special cases. We also introduce three new attention mechanisms: sq-pnorm-max\n(squared p-norm, here illustrated with p = 1.5), fusedmax (squared 2-norm + fused lasso), and\noscarmax (squared 2-norm + OSCAR; not pictured since it is equivalent to fusedmax in 2-d). Except\nfor softmax, which never exactly reaches 0, all mappings shown on the right encourage sparse outputs.\n\n2 Proposed regularized attention framework\n\n2.1 The max operator and its subgradient mapping\n\nTo motivate our proposal, we \ufb01rst show in this section that the subgradients of the maximum operator\nde\ufb01ne a mapping from Rd to \u2206d, but that this mapping is highly unsuitable as an attention mechanism.\nThe maximum operator is a function from Rd to R and can be de\ufb01ned by\n\nmax(x) := max\ni\u2208[d]\n\nxi = sup\ny\u2208\u2206d\n\nyTx.\n\nThe equality on the r.h.s comes from the fact that the supremum of a linear form over the simplex\nis always achieved at one of the vertices, i.e., one of the standard basis vectors {ei}d\ni=1. Moreover,\nit is not hard to check that any solution y\u22c6 of that supremum is precisely a subgradient of max(x):\n\u2202 max(x) = {ei\u22c6 : i\u22c6 \u2208 arg maxi\u2208[d] xi}. We can see these subgradients as a mapping \u03a0 : Rd \u2192\n\u2206d that puts all the probability mass onto a single element: \u03a0(x) = ei for any ei \u2208 \u2202 max(x).\nHowever, this behavior is undesirable, as the resulting mapping is a discontinuous function (a\nHeaviside step function when x = [t, 0]), which is not amenable to optimization by gradient descent.\n\n2.2 A regularized max operator and its gradient mapping\n\nThese shortcomings encourage us to consider a regularization of the maximum operator. Inspired by\nthe seminal work of Nesterov [35], we apply a smoothing technique. The conjugate of max(x) is\n\nFor a proof, see for instance [33, Appendix B]. We now add regularization to the conjugate\n\nmax\u2217(y) =(cid:26)0,\n\n\u221e,\n\nif y \u2208 \u2206d\no.w.\n\n.\n\nmax\u2217\n\n\u2126(y) :=(cid:26)\u03b3\u2126(y),\n\n\u221e,\n\nif y \u2208 \u2206d\no.w.\n\n,\n\nwhere we assume that \u2126 : Rd \u2192 R is \u03b2-strongly convex w.r.t. some norm k \u00b7 k and \u03b3 > 0 controls\nthe regularization strength. To de\ufb01ne a smoothed max operator, we take the conjugate once again\n\nmax\u2126(x) = max\u2217\u2217\n\n\u2126 (x) = sup\ny\u2208Rd\n\nyTx \u2212 max\u2217\n\n\u2126(y) = sup\ny\u2208\u2206d\n\nyTx \u2212 \u03b3\u2126(y).\n\n(1)\n\nOur main proposal is a mapping \u03a0\u2126 : Rd \u2192 \u2206d, de\ufb01ned as the argument that achieves this supremum.\n\n\u03a0\u2126(x) := arg max\n\nyTx \u2212 \u03b3\u2126(y) = \u2207max\u2126(x)\n\ny\u2208\u2206d\n\n\u2126(y\u22c6) \u21d4 y\u22c6 \u2208 \u2202max\u2126(x) and ii)\nThe r.h.s. holds by combining that i) max\u2126(x) = (y\u22c6)Tx \u2212 max\u2217\n\u2202max\u2126(x) = {\u2207max\u2126(x)}, since (1) has a unique solution. Therefore, \u03a0\u2126 is a gradient mapping.\nWe illustrate max\u2126 and \u03a0\u2126 for various choices of \u2126 in Figure 2 (2-d) and in Appendix C.1 (3-d).\n\n3\n\n42024t01234max([t,0])+ constmaxsoftmaxsparsemaxsq-pnorm-maxfusedmax42024t0.000.250.500.751.00([t,0])1=max([t,0])1\fImportance of strong convexity. Our \u03b2-strong convexity assumption on \u2126 plays a crucial role and\nshould not be underestimated. Recall that a function f : Rd \u2192 R is \u03b2-strongly convex w.r.t. a norm\nk \u00b7 k if and only if its conjugate f \u2217 is 1\n\u03b2 -smooth w.r.t. the dual norm k \u00b7 k\u2217 [46, Corollary 3.5.11]\n[22, Theorem 3]. This is suf\ufb01cient to ensure that max\u2126 is 1\n\u03b3\u03b2 -smooth, or, in other words, that it is\ndifferentiable everywhere and its gradient, \u03a0\u2126, is 1\n\n\u03b3\u03b2 -Lipschitz continuous w.r.t. k \u00b7 k\u2217.\n\nTraining by backpropagation. In order to use \u03a0\u2126 in a neural network trained by backpropagation,\ntwo problems must be addressed for any regularizer \u2126. The \ufb01rst is the forward computation: how\nto evaluate \u03a0\u2126(x), i.e., how to solve the optimization problem in (1). The second is the backward\ncomputation: how to evaluate the Jacobian of \u03a0\u2126(x), or, equivalently, the Hessian of max\u2126(x). One\nof our key contributions, presented in \u00a73, is to show how to solve these two problems for general\ndifferentiable \u2126, as well as for two structured regularizers: fused lasso and OSCAR.\n\n2.3 Recovering softmax and sparsemax as special cases\n\nBefore deriving new attention mechanisms using our framework, we now show how we can recover\nsoftmax and sparsemax, using a speci\ufb01c regularizer \u2126.\n\ni=1 yi log yi, the negative entropy. The conjugate of the negative\nentropy restricted to the simplex is the log sum exp [9, Example 3.25]. Moreover, if f (x) = \u03b3g(x)\nfor \u03b3 > 0, then f \u2217(y) = \u03b3g\u2217(y/\u03b3). We therefore get a closed-form expression: max\u2126(x) =\ni=1 exi/\u03b3. Since the negative entropy is 1-strongly convex w.r.t.\n\u03b3 -smooth w.r.t. k \u00b7 k\u221e. We obtain the classical softmax, with\n\nSoftmax. We choose \u2126(y) = Pd\n\u03b3 log sum exp(x/\u03b3) := \u03b3 logPd\n\nk \u00b7 k1 over \u2206d, we get that max\u2126 is 1\ntemperature parameter \u03b3, by taking the gradient of max\u2126(x),\n\n\u03a0\u2126(x) =\n\n,\n\n(softmax)\n\nex/\u03b3\ni=1 exi/\u03b3\n\nPd\n\nwhere ex/\u03b3 is evaluated element-wise. Note that some authors also call max\u2126 a \u201csoft max.\u201d Although\n\u03a0\u2126 is really a soft arg max, we opt to follow the more popular terminology. When x = [t, 0], it can\nbe checked that max\u2126(x) reduces to the softplus [16] and \u03a0\u2126(x)1 to a sigmoid.\nSparsemax. We choose \u2126(y) = 1\noperator theory [35, 36]. Since 1\nw.r.t. k \u00b7 k2. In addition, it is easy to verify that\n\n2, also known as Moreau-Yosida regularization in proximal\n\u03b3 -smooth\n\n2 kyk2\n\n2 kyk2\n\n2 is 1-strongly convex w.r.t. k\u00b7k2, we get that max\u2126 is 1\n\n\u03a0\u2126(x) = P\u2206d (x/\u03b3) = arg min\n\ny\u2208\u2206d\n\nky \u2212 x/\u03b3k2.\n\n(sparsemax)\n\nThis mapping was introduced as is in [31] with \u03b3 = 1 and was named sparsemax, due to the fact that\nit is a sparse alternative to softmax. Our derivation thus gives us a slight generalization, where \u03b3\ncontrols the sparsity (the smaller, the sparser) and could be tuned; in our experiments, however, we\nfollow the literature and set \u03b3 = 1. The Euclidean projection onto the simplex, P\u2206d , can be computed\nexactly [34, 15] (we discuss the complexity in Appendix B). Following [31], the Jacobian of \u03a0\u2126 is\n\nJ\u03a0\u2126 (x) =\n\n1\n\u03b3\n\nJP\u2206d (x/\u03b3) =\n\n1\n\n\u03b3 (cid:0)diag(s) \u2212 ssT/ksk1(cid:1) ,\n\nwhere s \u2208 {0, 1}d indicates the nonzero elements of \u03a0\u2126(x). Since \u03a0\u2126 is Lipschitz continuous,\nRademacher\u2019s theorem implies that \u03a0\u2126 is differentiable almost everywhere. For points where \u03a0\u2126 is\nnot differentiable (where max\u2126 is not twice differentiable), we can take an arbitrary matrix in the set\nJ\u03a0\u2126 (xt) [31].\nof Clarke\u2019s generalized Jacobians [11], the convex hull of Jacobians of the form lim\nxt\u2192x\n\n3 Deriving new sparse and structured attention mechanisms\n\n3.1 Differentiable regularizer \u2126\n\nBefore tackling more structured regularizers, we address in this section the case of general differen-\ntiable regularizer \u2126. Because \u03a0\u2126(x) involves maximizing (1), a concave function over the simplex,\nit can be computed globally using any off-the-shelf projected gradient solver. Therefore, the main\nchallenge is how to compute the Jacobian of \u03a0\u2126. This is what we address in the next proposition.\n\n4\n\n\fProposition 1 Jacobian of \u03a0\u2126 for any differentiable \u2126 (backward computation)\nAssume that \u2126 is differentiable over \u2206d and that \u03a0\u2126(x) = arg maxy\u2208\u2206d yTx \u2212 \u03b3\u2126(y) = y\u22c6 has\nbeen computed. Then the Jacobian of \u03a0\u2126 at x, denoted J\u03a0\u2126 , can be obtained by solving the system\n\nwhere we de\ufb01ned the shorthands A := JP\u2206d (y\u22c6 \u2212 \u03b3\u2207\u2126(y\u22c6) + x)\n\nand B := \u03b3H\u2126(y\u22c6).\n\n(I + A(B \u2212 I)) J\u03a0\u2126 = A,\n\nThe proof is given in Appendix A.1. Unlike recent work tackling argmin differentiation through matrix\ndifferential calculus on the Karush\u2013Kuhn\u2013Tucker (KKT) conditions [1], our proof technique relies on\ndifferentiating the \ufb01xed point iteration y\u2217 = P\u2206d (y\u22c6 \u2212 \u2207f (y\u22c6)). To compute J\u03a0\u2126 v for an arbitrary\nv \u2208 Rd, as required by backpropagation, we may directly solve (I + A(B \u2212 I)) (J\u03a0\u2126 v) = Av. We\nshow in Appendix B how this system can be solved ef\ufb01ciently thanks to the structure of A.\n\nSquared p-norms. As a useful example of a differentiable function over the simplex, we consider\nsquared p-norms: \u2126(y) = 1\n, where y \u2208 \u2206d and p \u2208 (1, 2]. For this choice\nof p, it is known that the squared p-norm is strongly convex w.r.t. k \u00b7 kp [3]. This implies that max\u2126 is\n\u03b3(p\u22121) smooth w.r.t. k.kq, where 1\nq = 1. We call the induced mapping function sq-pnorm-max:\n\np =(cid:16)Pd\n\ni(cid:17)2/p\n\ni=1 yp\n\n2 kyk2\n\np + 1\n\n1\n\n\u03a0\u2126(x) = arg min\n\ny\u2208\u2206d\n\n\u03b3\n2\n\nkyk2\n\np \u2212 yTx.\n\n(sq-pnorm-max)\n\nThe gradient and Hessian needed for Proposition 1 can be computed by \u2207\u2126(y) = y\n\np\u22121\nkykp\u22122\n\np\n\nand\n\nH\u2126(y) = diag(d) + uuT, where d =\n\n(p \u2212 1)\nkykp\u22122\n\np\n\nyp\u22122\n\nand u =s (2 \u2212 p)\n\nkyk2p\u22122\n\np\n\nyp\u22121,\n\nwith the exponentiation performed element-wise. sq-pnorm-max recovers sparsemax with p = 2\nand, like sparsemax, encourages sparse outputs. However, as can be seen in the zoomed box in\nFigure 2 (right), the transition between y\u22c6 = [0, 1] and y\u22c6 = [1, 0] can be smoother when 1 < p < 2.\nThroughout our experiments, we use p = 1.5.\n\n3.2 Structured regularizers: fused lasso and OSCAR\n\nFusedmax. For cases when the input is sequential and the order is meaningful, as is the case\nfor many natural languages, we propose fusedmax, an attention mechanism based on fused lasso\n[42], also known as 1-d total variation (TV). Fusedmax encourages paying attention to contiguous\nsegments, with equal weights within each one. It is expressed under our framework by choosing\n\u2126(y) = 1\ni=1 |yi+1 \u2212 yi|, i.e., the sum of a strongly convex term and of a 1-d TV penalty.\nIt is easy to verify that this choice yields the mapping\n\n2 kyk2\n\n2 + \u03bbPd\u22121\n\n\u03a0\u2126(x) = arg min\n\ny\u2208\u2206d\n\n1\n2\n\nky \u2212 x/\u03b3k2 + \u03bb\n\nd\u22121\n\nXi=1\n\n|yi+1 \u2212 yi|.\n\n(fusedmax)\n\nOscarmax. For situations where the contiguity assumption may be too strict, we propose oscarmax,\nbased on the OSCAR penalty [7], to encourage attention weights to merge into clusters with the\nsame value, regardless of position in the sequence. This is accomplished by replacing the 1-d\nTV penalty in fusedmax with an \u221e-norm penalty on each pair of attention weights, i.e., \u2126(y) =\n1\n2 kyk2\n\n2 + \u03bbPi<j max(|yi|, |yj|). This results in the mapping\n\n\u03a0\u2126(x) = arg min\n\nmax(|yi|, |yj|).\n\n(oscarmax)\n\ny\u2208\u2206d\n\n1\n2\n\nky \u2212 x/\u03b3k2 + \u03bbXi<j\n\nForward computation. Due to the y \u2208 \u2206d constraint, computing fusedmax/oscarmax does not\nseem trivial on \ufb01rst sight. The next proposition shows how to do so, without any iterative method.\n\nProposition 2 Computing fusedmax and oscarmax (forward computation)\n\nfusedmax: \u03a0\u2126(x) = P\u2206d (PTV (x/\u03b3)) ,\n\nPTV(x) := arg min\n\ny\u2208Rd\n\noscarmax: \u03a0\u2126(x) = P\u2206d (POSC (x/\u03b3)) , POSC(x) := arg min\n\ny\u2208Rd\n\n1\n2\n1\n2\n\nd\u22121\n\nky \u2212 xk2 + \u03bb\n\nXi=1\nky \u2212 xk2 + \u03bbXi<j\n\n|yi+1 \u2212 yi|.\n\nmax(|yi|, |yj|).\n\n5\n\n\fHere, PTV and POSC indicate the proximal operators of 1-d TV and OSCAR, and can be computed\nexactly by [13] and [47], respectively. To remind the reader, P\u2206d denotes the Euclidean projection\nonto the simplex and can be computed exactly using [34, 15]. Proposition 2 shows that we can\ncompute fusedmax and oscarmax using the composition of two functions, for which exact non-\niterative algorithms exist. This is a surprising result, since the proximal operator of the sum of two\nfunctions is not, in general, the composition of the proximal operators of each function. The proof\nfollows by showing that the indicator function of \u2206d satis\ufb01es the conditions of [45, Corollaries 4,5].\nGroups induced by PTV and POSC. Let z\u22c6 be the optimal solution of PTV(x) or POSC(x). For PTV,\nwe denote the group of adjacent elements with the same value as z\u22c6\ni , \u2200i \u2208 [d]. Formally,\nG\u22c6\ni = [a, b] \u2229 N with a \u2264 i \u2264 b where a and b are the minimal and maximal indices such that\nz\u22c6\ni = z\u22c6\ni as the indices of elements with the same absolute\nj |}. Because P\u2206d (z\u22c6) = max(z\u22c6 \u2212 \u03b8, 0) for\nvalue as z\u22c6\ni | = |z\u22c6\nsome \u03b8 \u2208 R, fusedmax/oscarmax either shift a group\u2019s common value or set all its elements to zero.\n\ni . For POSC, we de\ufb01ne G\u22c6\ni = {j \u2208 [d] : |z\u22c6\n\ni , more formally G\u22c6\n\nj for all j \u2208 G\u22c6\n\ni by G\u22c6\n\n\u03bb controls the trade-off between no fusion (sparsemax) and all elements fused into a single trivial\ngroup. While tuning \u03bb may improve performance, we observe that \u03bb = 0.1 (fusedmax) and \u03bb = 0.01\n(oscarmax) are sensible defaults that work well across all tasks and report all our results using them.\n\nBackward computation. We already know that the Jacobian of P\u2206d is the same as that of sparsemax\nwith \u03b3 = 1. Then, by Proposition 2, if we know how to compute the Jacobians of PTV and POSC, we\ncan obtain the Jacobians of fusedmax and oscarmax by straightforward application of the chain rule.\nHowever, although PTV and POSC can be computed exactly, they lack analytical expressions. We next\nshow that we can nonetheless compute their Jacobians ef\ufb01ciently, without needing to solve a system.\n\nProposition 3 Jacobians of PTV and POSC (backward computation)\n\nAssume z\u22c6 = PTV(x) or POSC(x) has been computed. De\ufb01ne the groups derived from z\u22c6 as above.\n\nThen, [JPTV(x)]i,j =( 1\n\n|G\u22c6\ni |\n0\n\nif j \u2208 G\u22c6\ni ,\no.w.\n\nand [JPOSC (x)]i,j =( sign(z\u22c6\n\ni z\u22c6\nj )\n|G\u22c6\ni |\n\n0\n\ni and z\u22c6\n\ni 6= 0,\n\nif j \u2208 G\u22c6\no.w.\n\n.\n\nThe proof is given in Appendix A.2. Clearly, the structure of these Jacobians permits ef\ufb01cient\nJacobian-vector products; we discuss the computational complexity and implementation details in\nAppendix B. Note that PTV and POSC are differentiable everywhere except at points where groups\nchange. For these points, the same remark as for sparsemax applies, and we can use Clarke\u2019s Jacobian.\n\n4 Experimental results\n\nWe showcase the performance of our attention mechanisms on three challenging natural language\ntasks: textual entailment, machine translation, and sentence summarization. We rely on available,\nwell-established neural architectures, so as to demonstrate simple drop-in replacement of softmax with\nstructured sparse attention; quite likely, newer task-speci\ufb01c models could lead to further improvement.\n\n4.1 Textual entailment (a.k.a. natural language inference) experiments\n\nTextual entailment is the task of deciding, given a text T and an hypothesis H, whether a human\nreading T is likely to infer that H is true [14]. We use the Stanford Natural Language Inference (SNLI)\ndataset [8], a collection of 570,000 English sentence pairs. Each pair consists of a sentence and an\nhypothesis, manually labeled with one of the labels ENTAILMENT, CONTRADICTION, or NEUTRAL.\n\nWe use a variant of the neural attention\u2013based classi\ufb01er proposed for\nthis dataset by [38] and follow the same methodology as [31] in terms\nof implementation, hyperparameters, and grid search. We employ the\nCPU implementation provided in [31] and simply replace sparsemax\nwith fusedmax/oscarmax; we observe that training time per epoch\nis essentially the same for each of the four attention mechanisms\n(timings and more experimental details in Appendix C.2).\n\nTable 1 shows that, for this task, fusedmax reaches the highest ac-\ncuracy, and oscarmax slightly outperforms softmax. Furthermore,\n\nTable 1: Textual entailment\ntest accuracy on SNLI [8].\n\nattention\n\naccuracy\n\nsoftmax\nsparsemax\n\nfusedmax\noscarmax\n\n81.66\n82.39\n\n82.41\n81.76\n\n6\n\n\fFigure 3: Attention weights when considering the contradicted hypothesis \u201cNo one is dancing.\u201d\n\nfusedmax results in the most interpretable feature groupings: Figure 3 shows the weights of the\nneural network\u2019s attention to the text, when considering the hypothesis \u201cNo one is dancing.\u201d In this\ncase, all four models correctly predicted that the text \u201cA band is playing on stage at a concert and the\nattendants are dancing to the music,\u201d denoted along the x-axis, contradicts the hypothesis, although\nthe attention weights differ. Notably, fusedmax identi\ufb01es the meaningful segment \u201cband is playing\u201d.\n\n4.2 Machine translation experiments\n\nSequence-to-sequence neural machine translation (NMT) has recently become a strong contender in\nmachine translation [2, 29]. In NMT, attention weights can be seen as an alignment between source\nand translated words. To demonstrate the potential of our new attention mechanisms for NMT, we ran\nexperiments on 10 language pairs. We build on OpenNMT-py [24], based on PyTorch [37], with all\ndefault hyperparameters (detailed in Appendix C.3), simply replacing softmax with the proposed \u03a0\u2126.\n\nOpenNMT-py with softmax attention is optimized for the GPU. Since sparsemax, fusedmax, and\noscarmax rely on sorting operations, we implement their computations on the CPU for simplicity,\nkeeping the rest of the pipeline on the GPU. However, we observe that, even with this context\nswitching, the number of tokens processed per second was within 3/4 of the softmax pipeline. For\nsq-pnorm-max, we observe that the projected gradient solver used in the forward pass, unlike the\nlinear system solver used in the backward pass, could become a computational bottleneck. To mitigate\nthis effect, we set the tolerance of the solver\u2019s stopping criterion to 10\u22122.\n\nQuantitatively, we \ufb01nd that all compared attention mechanisms are always within 1 BLEU score\npoint of the best mechanism (for detailed results, cf. Appendix C.3). This suggests that structured\nsparsity does not restrict accuracy. However, as illustrated in Figure 4, fusedmax and oscarmax often\nlead to more interpretable attention alignments, as well as to qualitatively different translations.\n\nFigure 4: Attention weights for French to English translation, using the conventions of Figure 1.\nWithin a row, weights grouped by oscarmax under the same cluster are denoted by \u201c\u2022\u201d. Here, oscarmax\n\ufb01nds a slightly more natural English translation. More visulizations are given in Appendix C.3.\n\n4.3 Sentence summarization experiments\n\nAttention mechanisms were recently explored for sentence summarization in [39]. To generate\nsentence-summary pairs at low cost, the authors proposed to use the title of a news article as a\nnoisy summary of the article\u2019s leading sentence. They collected 4 million such pairs from the\nGigaword dataset and showed that this seemingly simplistic approach leads to models that generalize\n\n7\n\nAbandisplayingonstageataconcertandtheattendantsaredancingtothemusic.0.00.10.2softmaxAbandisplayingonstageataconcertandtheattendantsaredancingtothemusic.0.00.10.2sparsemaxAbandisplayingonstageataconcertandtheattendantsaredancingtothemusic.0.00.10.20.3fusedmaxAbandisplayingonstageataconcertandtheattendantsaredancingtothemusic.0.00.10.2oscarmaxLacoalitionpourl'aideinternationaledevraitlelireavecattention.thecoalitionforinternationalaidshouldreaditcarefully.<EOS>fusedmaxLacoalitionpourl'aideinternationaledevraitlelireavecattention.theinternationalaidcoalitionshouldreaditcarefully.<EOS>oscarmaxLacoalitionpourl'aideinternationaledevraitlelireavecattention.thecoalitionforinternationalaidshouldreaditcarefully.<EOS>softmax\fTable 2: Sentence summarization results, following the same experimental setting as in [39].\n\nDUC 2004\n\nGigaword\n\nattention ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L\n\nsoftmax\nsparsemax\n\nfusedmax\noscarmax\nsq-pnorm-max\n\n27.16\n27.69\n\n28.42\n27.84\n27.94\n\n9.48\n9.55\n\n9.96\n9.46\n9.28\n\n24.47\n24.96\n\n25.55\n25.14\n25.08\n\n35.13\n36.04\n\n36.09\n35.36\n35.94\n\n17.15\n17.78\n\n17.62\n17.23\n17.75\n\n32.92\n33.64\n\n33.69\n33.03\n33.66\n\nsurprisingly well. We follow their experimental setup and are able to reproduce comparable results to\ntheirs with OpenNMT when using softmax attention. The models we use are the same as in \u00a74.2.\n\nOur evaluation follows [39]: we use the standard DUC 2004 dataset (500 news articles each paired\nwith 4 different human-generated summaries) and a randomly held-out subset of Gigaword, released\nby [39]. We report results on ROUGE-1, ROUGE-2, and ROUGE-L. Our results, in Table 2, indicate that\nfusedmax is the best under nearly all metrics, always outperforming softmax. In addition to Figure 1,\nwe exemplify our enhanced interpretability and provide more detailed results in Appendix C.4.\n\n5 Related work\n\nSmoothed max operators. Replacing the max operator by a differentiable approximation based\non the log sum exp has been exploited in numerous works. Regularizing the max operator with a\nsquared 2-norm is less frequent, but has been used to obtain a smoothed multiclass hinge loss [41] or\nsmoothed linear programming relaxations for maximum a-posteriori inference [33]. Our work differs\nfrom these in mainly two aspects. First, we are less interested in the max operator itself than in its\ngradient, which we use as a mapping from Rd to \u2206d. Second, since we use this mapping in neural\nnetworks trained with backpropagation, we study and compute the mapping\u2019s Jacobian (the Hessian\nof a regularized max operator), in contrast with previous works.\n\nInterpretability, structure and sparsity in neural networks. Providing interpretations alongside\npredictions is important for accountability, error analysis and exploratory analysis, among other\nreasons. Toward this goal, several recent works have been relying on visualizing hidden layer\nactivations [20, 27] and the potential for interpretability provided by attention mechanisms has been\nnoted in multiple works [2, 38, 39]. Our work aims to ful\ufb01ll this potential by providing a uni\ufb01ed\nframework upon which new interpretable attention mechanisms can be designed, using well-studied\ntools from the \ufb01eld of structured sparse regularization.\n\nSelecting contiguous text segments for model interpretations is explored in [26], where an explanation\ngenerator network is proposed for justifying predictions using a fused lasso penalty. However, this\nnetwork is not an attention mechanism and has its own parameters to be learned. Furthemore,\n[26] sidesteps the need to backpropagate through the fused lasso, unlike our work, by using a\nstochastic training approach. In constrast, our attention mechanisms are deterministic and drop-in\nreplacements for existing ones. As a consequence, our mechanisms can be coupled with recent\nresearch that builds on top of softmax attention, for example in order to incorporate soft prior\nknowledge about NMT alignment into attention through penalties on the attention weights [12].\n\nA different approach to incorporating structure into attention uses the posterior marginal probabilities\nfrom a conditional random \ufb01eld as attention weights [23]. While this approach takes into account\nstructural correlations, the marginal probabilities are generally dense and different from each other.\nOur proposed mechanisms produce sparse and clustered attention weights, a visible bene\ufb01t in\ninterpretability. The idea of deriving constrained alternatives to softmax has been independently\nexplored for differentiable easy-\ufb01rst decoding [32]. Finally, sparsity-inducing penalties have been\nused to obtain convex relaxations of neural networks [5] or to compress models [28, 43, 40]. These\nworks differ from ours, in that sparsity is enforced on the network parameters, while our approach\ncan produce sparse and structured outputs from neural attention layers.\n\n8\n\n\f6 Conclusion and future directions\n\nWe proposed in this paper a uni\ufb01ed regularized framework upon which new attention mechanisms can\nbe designed. To enable such mechanisms to be used in a neural network trained by backpropagation,\nwe demonstrated how to carry out forward and backward computations for general differentiable\nregularizers. We further developed two new structured attention mechanisms, fusedmax and oscarmax,\nand demonstrated that they enhance interpretability while achieving comparable or better accuracy\non three diverse and challenging tasks: textual entailment, machine translation, and summarization.\n\nThe usefulness of a differentiable mapping from real values to the simplex or to [0, 1] with sparse or\nstructured outputs goes beyond attention mechanisms. We expect that our framework will be useful\nto sample from categorical distributions using the Gumbel trick [21, 30], as well as for conditional\ncomputation [6] or differentiable neural computers [18, 19]. We plan to explore these in future work.\n\nAcknowledgements\n\nWe are grateful to Andr\u00e9 Martins, Takuma Otsuka, Fabian Pedregosa, Antoine Rolet, Jun Suzuki, and\nJustine Zhang for helpful discussions. We thank the anonymous reviewers for the valuable feedback.\n\nReferences\n\n[1] B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neural networks. In\n\nProc. of ICML, 2017.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. In Proc. of ICLR, 2015.\n\n[3] K. Ball, E. A. Carlen, and E. H. Lieb. Sharp uniform convexity and smoothness inequalities for\n\ntrace norms. Inventiones Mathematicae, 115(1):463\u2013482, 1994.\n\n[4] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In\n\nProc. of NIPS, 2005.\n\n[6] Y. Bengio, N. L\u00e9onard, and A. Courville. Estimating or propagating gradients through stochastic\n\nneurons for conditional computation. In Proc. of NIPS, 2013.\n\n[7] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and\n\nsupervised clustering of predictors with OSCAR. Biometrics, 64(1):115\u2013123, 2008.\n\n[8] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning\n\nnatural language inference. In Proc. of EMNLP, 2015.\n\n[9] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n\n[10] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for\n\nspeech recognition. In Proc. of NIPS, 2015.\n\n[11] F. H. Clarke. Optimization and nonsmooth analysis. SIAM, 1990.\n\n[12] T. Cohn, C. D. V. Hoang, E. Vymolova, K. Yao, C. Dyer, and G. Haffari. Incorporating structural\nalignment biases into an attentional neural translation model. In Proc. of NAACL-HLT, 2016.\n\n[13] L. Condat. A direct algorithm for 1-d total variation denoising. IEEE Signal Processing Letters,\n\n20(11):1054\u20131057, 2013.\n\n[14] I. Dagan, B. Dolan, B. Magnini, and D. Roth. Recognizing textual entailment: Rational,\n\nevaluation and approaches. Natural Language Engineering, 15(4):i\u2013xvii, 2009.\n\n[15] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the \u21131-ball\n\nfor learning in high dimensions. In Proc. of ICML, 2008.\n\n9\n\n\f[16] C. Dugas, Y. Bengio, F. B\u00e9lisle, C. Nadeau, and R. Garcia. Incorporating second-order functional\n\nknowledge for better option pricing. Proc. of NIPS, 2001.\n\n[17] J. Friedman, T. Hastie, H. H\u00f6\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. The\n\nAnnals of Applied Statistics, 1(2):302\u2013332, 2007.\n\n[18] A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. In Proc. of NIPS, 2014.\n\n[19] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi\u00b4nska, S. G.\nColmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural\nnetwork with dynamic external memory. Nature, 538(7626):471\u2013476, 2016.\n\n[20] O. Irsoy. Deep sequential and structural neural models of compositionality. PhD thesis, Cornell\n\nUniversity, 2017.\n\n[21] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with Gumbel-Softmax. In Proc. of\n\nICLR, 2017.\n\n[22] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques for learning with\n\nmatrices. Journal of Machine Learning Research, 13:1865\u20131890, 2012.\n\n[23] Y. Kim, C. Denton, L. Hoang, and A. M. Rush. Structured attention networks. In Proc. of ICLR,\n\n2017.\n\n[24] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Open-source toolkit for\n\nneural machine translation. arXiv e-prints, 2017.\n\n[25] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen,\nC. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source\ntoolkit for statistical machine translation. In Proc. of ACL, 2007.\n\n[26] T. Lei, R. Barzilay, and T. Jaakkola. Rationalizing neural predictions. In Proc. of EMNLP,\n\n2016.\n\n[27] J. Li, X. Chen, E. Hovy, and D. Jurafsky. Visualizing and understanding neural models in NLP.\n\nIn Proc. of NAACL-HLT, 2016.\n\n[28] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks.\n\nIn Proc. of ICCVPR, 2015.\n\n[29] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural\n\nmachine translation. In Proc. of EMNLP, 2015.\n\n[30] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In Proc. of ICLR, 2017.\n\n[31] A. F. Martins and R. F. Astudillo. From softmax to sparsemax: A sparse model of attention and\n\nmulti-label classi\ufb01cation. In Proc. of ICML, 2016.\n\n[32] A. F. Martins and J. Kreutzer. Learning what\u2019s easy: Fully differentiable neural easy-\ufb01rst\n\ntaggers. In Proc. of EMNLP, 2017.\n\n[33] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and strong: MAP inference with linear\n\nconvergence. In Proc. of NIPS, 2015.\n\n[34] C. Michelot. A \ufb01nite algorithm for \ufb01nding the projection of a point onto the canonical simplex\n\nof Rn. Journal of Optimization Theory and Applications, 50(1):195\u2013200, 1986.\n\n[35] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[36] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends R(cid:13) in Optimization,\n\n1(3):127\u2013239, 2014.\n\n[37] PyTorch. http://pytorch.org, 2017.\n\n10\n\n\f[38] T. Rockt\u00e4schel, E. Grefenstette, K. M. Hermann, T. Kocisky, and P. Blunsom. Reasoning about\n\nentailment with neural attention. In Proc. of ICLR, 2016.\n\n[39] A. M. Rush, S. Chopra, and J. Weston. A neural attention model for abstractive sentence\n\nsummarization. In Proc. of EMNLP, 2015.\n\n[40] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for\n\ndeep neural networks. Neurocomputing, 241:81\u201389, 2017.\n\n[41] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\n\nregularized loss minimization. Mathematical Programming, 155(1):105\u2013145, 2016.\n\n[42] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via\nthe fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n67(1):91\u2013108, 2005.\n\n[43] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural\n\nnetworks. In Proc. of NIPS, 2016.\n\n[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show,\nattend and tell: Neural image caption generation with visual attention. In Proc. of ICML, 2015.\n\n[45] Y. Yu. On decomposing the proximal map. In Proc. of NIPS, 2013.\n\n[46] C. Zalinescu. Convex analysis in general vector spaces. World Scienti\ufb01c, 2002.\n\n[47] X. Zeng and M. A. Figueiredo. Solving OSCAR regularization problems by fast approximate\n\nproximal splitting algorithms. Digital Signal Processing, 31:124\u2013135, 2014.\n\n[48] X. Zeng and F. A. Mario. The ordered weighted \u21131 norm: Atomic formulation, dual norm, and\n\nprojections. arXiv e-prints, 2014.\n\n[49] L. W. Zhong and J. T. Kwok. Ef\ufb01cient sparse modeling with automatic feature grouping. IEEE\n\ntransactions on neural networks and learning systems, 23(9):1436\u20131447, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1888, "authors": [{"given_name": "Vlad", "family_name": "Niculae", "institution": "Cornell University"}, {"given_name": "Mathieu", "family_name": "Blondel", "institution": "NTT"}]}