{"title": "Latent Alignment and Variational Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 9712, "page_last": 9724, "abstract": "Neural attention has become central to many state-of-the-art models in natural language processing and related domains. Attention networks are an easy-to-train and effective method for softly simulating alignment; however, the approach does not marginalize over latent alignments in a probabilistic sense. This property makes it difficult to compare attention to other alignment approaches, to compose it with probabilistic models, and to perform posterior inference conditioned on observed data. A related latent approach, hard attention, fixes these issues, but is generally harder to train and less accurate. This work considers variational attention networks, alternatives to soft and hard attention for learning latent variable alignment models, with tighter approximation bounds based on amortized variational inference. We further propose methods for reducing the variance of gradients to make these approaches computationally feasible. Experiments show that for machine translation and visual question answering, inefficient exact latent variable models outperform standard neural attention, but these gains go away when using hard attention based training. On the other hand, variational attention retains most of the performance gain but with training speed comparable to neural attention.", "full_text": "Latent Alignment and Variational Attention\n\nYuntian Deng\u2217\n\nYoon Kim\u2217\n\nJustin Chiu\n\nDemi Guo\n\nAlexander M. Rush\n\n{dengyuntian@seas,yoonkim@seas,justinchiu@g,dguo@college,srush@seas}.harvard.edu\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\nCambridge, MA, USA\n\nAbstract\n\nNeural attention has become central to many state-of-the-art models in natural\nlanguage processing and related domains. Attention networks are an easy-to-train\nand effective method for softly simulating alignment; however, the approach does\nnot marginalize over latent alignments in a probabilistic sense. This property makes\nit dif\ufb01cult to compare attention to other alignment approaches, to compose it with\nprobabilistic models, and to perform posterior inference conditioned on observed\ndata. A related latent approach, hard attention, \ufb01xes these issues, but is generally\nharder to train and less accurate. This work considers variational attention net-\nworks, alternatives to soft and hard attention for learning latent variable alignment\nmodels, with tighter approximation bounds based on amortized variational infer-\nence. We further propose methods for reducing the variance of gradients to make\nthese approaches computationally feasible. Experiments show that for machine\ntranslation and visual question answering, inef\ufb01cient exact latent variable models\noutperform standard neural attention, but these gains go away when using hard\nattention based training. On the other hand, variational attention retains most of\nthe performance gain but with training speed comparable to neural attention.\n\n1\n\nIntroduction\n\nAttention networks [6] have quickly become the foundation for state-of-the-art models in natural\nlanguage understanding, question answering, speech recognition, image captioning, and more [15, 81,\n16, 14, 63, 80, 71, 62]. Alongside components such as residual blocks and long-short term memory\nnetworks, soft attention provides a rich neural network building block for controlling gradient \ufb02ow\nand encoding inductive biases. However, more so than these other components, which are often\ntreated as black-boxes, researchers use intermediate attention decisions directly as a tool for model\ninterpretability [43, 1] or as a factor in \ufb01nal predictions [25, 68]. From this perspective, attention\nplays the role of a latent alignment variable [10, 37]. An alternative approach, hard attention [80],\nmakes this connection explicit by introducing a latent variable for alignment and then optimizing a\nbound on the log marginal likelihood using policy gradients. This approach generally performs worse\n(aside from a few exceptions such as [80]) and is used less frequently than its soft counterpart.\nStill the latent alignment approach remains appealing for several reasons: (a) latent variables facilitate\nreasoning about dependencies in a probabilistically principled way, e.g. allowing composition with\nother models, (b) posterior inference provides a better basis for model analysis and partial predictions\nthan strictly feed-forward models, which have been shown to underperform on alignment in machine\ntranslation [38], and \ufb01nally (c) directly maximizing marginal likelihood may lead to better results.\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwitch\n\np(z|x,\u02dcx)\nq(z;x,\u02dcx,y)\n\nverda\n\nz\n\n\u02dcx\n\ny3\n\nMary did\n\nnot\n\nslap\n\nthe\n\nx1:T\n\ngreen\n\nMaria no daba una bofetada a la bruja\n\nFigure 1: Sketch of variational attention applied to\nmachine translation. Two alignment distributions are\nshown, the blue prior p, and the red variational posterior\nq taking into account future observations. Our aim is to\nuse q to improve estimates of p and to support improved\ninference of z.\n\nThe aim of this work is to quantify the issues with attention and propose alternatives based on recent\ndevelopments in variational inference. While the connection between variational inference and hard\nattention has been noted in the literature [4, 41], the space of possible bounds and optimization\nmethods has not been fully explored and is growing quickly. These tools allow us to better quantify\nwhether the general underperformance of hard attention models is due to modeling issues (i.e. soft\nattention imbues a better inductive bias) or optimization issues.\nOur main contribution is a variational attention\napproach that can effectively \ufb01t latent align-\nments while remaining tractable to train. We\nconsider two variants of variational attention:\ncategorical and relaxed. The categorical method\nis \ufb01t with amortized variational inference using\na learned inference network and policy gradient\nwith a soft attention variance reduction baseline.\nWith an appropriate inference network (which\nconditions on the entire source/target), it can be\nused at training time as a drop-in replacement\nfor hard attention. The relaxed version assumes\nthat the alignment is sampled from a Dirichlet\ndistribution and hence allows attention over mul-\ntiple source elements.\nExperiments describe how to implement this\napproach for two major attention-based models:\nneural machine translation and visual question\nanswering (Figure 1 gives an overview of our\napproach for machine translation). We \ufb01rst show\nthat maximizing exact marginal likelihood can increase performance over soft attention. We further\nshow that with variational (categorical) attention, alignment variables signi\ufb01cantly surpass both\nsoft and hard attention results without requiring much more dif\ufb01cult training. We further explore\nthe impact of posterior inference on alignment decisions, and how latent variable models might be\nemployed. Our code is available at https://github.com/harvardnlp/var-attn/.\nRelated Work Latent alignment has long been a core problem in NLP, starting with the seminal IBM\nmodels [11], HMM-based alignment models [75], and a fast log-linear reparameterization of the IBM\n2 model [20]. Neural soft attention models were originally introduced as an alternative approach\nfor neural machine translation [6], and have subsequently been successful on a wide range of tasks\n(see [15] for a review of applications). Recent work has combined neural attention with traditional\nalignment [18, 72] and induced structure/sparsity [48, 33, 44, 85, 54, 55, 49], which can be combined\nwith the variational approaches outlined in this paper.\nIn contrast to soft attention models, hard attention [80, 3] approaches use a single sample at training\ntime instead of a distribution. These models have proven much more dif\ufb01cult to train, and existing\nworks typically treat hard attention as a black-box reinforcement learning problem with log-likelihood\nas the reward [80, 3, 53, 26, 19]. Two notable exceptions are [4, 41]: both utilize amortized variational\ninference to learn a sampling distribution which is used obtain importance-sampled estimates of the\nlog marginal likelihood [12]. Our method uses uses different estimators and targets the single sample\napproach for ef\ufb01ciency, allowing the method to be employed for NMT and VQA applications.\nThere has also been signi\ufb01cant work in using variational autoencoders for language and translation\napplication. Of particular interest are those that augment an RNN with latent variables (typically\nGaussian) at each time step [17, 22, 66, 23, 40] and those that incorporate latent variables into\nsequence-to-sequence models [84, 7, 70, 64]. Our work differs by modeling an explicit model\ncomponent (alignment) as a latent variable instead of auxiliary latent variables (e.g. topics). The\nterm \"variational attention\" has been used to refer to a different component the output from attention\n(commonly called the context vector) as a latent variable [7], or to model both the memory and the\nalignment as a latent variable [9]. Finally, there is some parallel work [78, 67] which also performs\nexact/approximate marginalization over latent alignments for sequence-to-sequence learning.\n\n2\n\n\f2 Background: Latent Alignment and Neural Attention\n\nWe begin by introducing notation for latent alignment, and then show how it relates to neural attention.\nFor clarity, we are careful to use alignment to refer to this probabilistic model (Section 2.1), and soft\nand hard attention to refer to two particular inference approaches used in the literature to estimate\nalignment models (Section 2.2).\n\n2.1 Latent Alignment\n\nFigure 2(a) shows a latent alignment model. Let x be an observed set with associated members\n{x1, . . . , xi, . . . , xT}. Assume these are vector-valued (i.e. xi \u2208 Rd) and can be stacked to form a\nmatrix X \u2208 Rd\u00d7T . Let the observed \u02dcx be an arbitrary \u201cquery\u201d. These generate a discrete output\nvariable y \u2208 Y. This process is mediated through a latent alignment variable z, which indicates\nwhich member (or mixture of members) of x generates y. The generative process we consider is:\n\nz \u223c D(a(x, \u02dcx; \u03b8))\n\ny \u223c f (x, z; \u03b8)\n\nwhere a produces the parameters for an alignment distribution D. The function f gives a distribution\nover the output, e.g. an exponential family. To \ufb01t this model to data, we set the model parameters \u03b8\nby maximizing the log marginal likelihood of training examples (x, \u02dcx, \u02c6y):2\n\nmax\n\n\u03b8\n\nlog p(y = \u02c6y | x, \u02dcx) = max\n\nlog Ez[f (x, z; \u03b8)\u02c6y]\n\n\u03b8\n\nDirectly maximizing this log marginal likeli-\nhood in the presence of the latent variable z\nis often dif\ufb01cult due to the expectation (though\ntractable in certain cases).\nFor this to represent an alignment, we restrict\nthe variable z to be in the simplex \u2206T\u22121 over\nsource indices {1, . . . , T}. We consider two dis-\ntributions for this variable: \ufb01rst, let D be a cat-\negorical where z is a one-hot vector with zi = 1\nif xi is selected. For example, f (x, z) could use\nz to pick from x and apply a softmax layer to\npredict y, i.e. f (x, z) = softmax(WXz) and\nW \u2208 R|Y|\u00d7d,\n\nT(cid:88)\n\n\u02dcx\n\nz\n\n\u02dcx\n\nxi\n\ny\n\nxi\n\ny\n\n(a)\n\n(b)\n\nFigure 2: Models over observed set x, query \u02dcx, and\nalignment z. (a) Latent alignment model, (b) Soft atten-\ntion with z absorbed into prediction network.\n\nlog p(y = \u02c6y | x, \u02dcx) = log\n\np(zi = 1| x, \u02dcx)p(y = \u02c6y | x, zi = 1) = log Ez[softmax(WXz)\u02c6y]\n\ni=1\n\nThis computation requires a factor of O(T ) additional runtime, and introduces a major computational\nfactor into already expensive deep learning models.3\nSecond we consider a relaxed alignment where z is a mixture taken from the interior of the simplex by\nletting D be a Dirichlet. This objective looks similar to the categorical case, i.e. log p(y = \u02c6y | x, \u02dcx) =\nlog Ez[softmax(WXz)\u02c6y], but the resulting expectation is intractable to compute exactly.\n\n2.2 Attention Models: Soft and Hard\n\nWhen training deep learning models with gradient methods, it can be dif\ufb01cult to use latent alignment\ndirectly. As such, two alignment-like approaches are popular: soft attention replaces the probabilistic\nmodel with a deterministic soft function and hard attention trains a latent alignment model by\nmaximizing a lower bound on the log marginal likelihood (obtained from Jensen\u2019s inequality) with\npolicy gradient-style training. We brie\ufb02y describe how these methods \ufb01t into this notation.\n\n2When clear from context, the random variable is dropped from E[\u00b7]. We also interchangeably use p(\u02c6y | x, \u02dcx)\n\nand f (x, z; \u03b8)\u02c6y to denote p(y = \u02c6y | x, \u02dcx).\n\n3Although not our main focus, explicit marginalization is sometimes tractable with ef\ufb01cient matrix operations\non modern hardware, and we compare the variational approach to explicit enumeration in the experiments. In\nsome cases it is also possible to ef\ufb01ciently perform exact marginalization with dynamic programming if one\nimposes additional constraints (e.g. monotonicity) on the alignment distribution [83, 82, 58].\n\n3\n\n\fSoft Attention Soft attention networks use an altered model shown in Figure 2b. Instead of using a\nlatent variable, they employ a deterministic network to compute an expectation over the alignment\nvariable. We can write this model using the same functions f and a from above,\n\nlog psoft(y | x, \u02dcx) = log f (x, Ez[z]; \u03b8) = log softmax(WXEz[z])\n\nA major bene\ufb01t of soft attention is ef\ufb01ciency. Instead of paying a multiplicative penalty of O(T )\nor requiring integration, the soft attention model can compute the expectation before f. While\nformally a different model, soft attention has been described as an approximation of alignment [80].\nSince E[z] \u2208 \u2206T\u22121, soft attention uses a convex combination of the input representations XE[z]\n(the context vector) to obtain a distribution over the output. While also a \u201crelaxed\u201d decision, this\nexpression differs from both the latent alignment models above. Depending on f, the gap between\nE[f (x, z)] and f (x, E[z]) may be large.\nHowever there are some important special cases. In the case where p(z | x, \u02dcx) is deterministic, we\nhave E[f (x, z)] = f (x, E[z]), and p(y | x, \u02dcx) = psoft(y | x, \u02dcx). In general we can bound the absolute\ndifference based on the maximum curvature of f, as shown by the following proposition.\nProposition 1. De\ufb01ne gx,\u02c6y : \u2206T\u22121 (cid:55)\u2192 [0, 1] to be the function given by gx,\u02c6y(z) = f (x, z)\u02c6y (i.e.\ngx,\u02c6y(z) = p(y = \u02c6y | x, \u02dcx, z)) for a twice differentiable function f. Let Hgx, \u02c6y (z) be the Hessian of\ngx,\u02c6y(z) evaluated at z, and further suppose (cid:107)Hgx, \u02c6y (z)(cid:107)2 \u2264 c for all z \u2208 \u2206T\u22121, \u02c6y \u2208 Y, and x, where\n(cid:107) \u00b7 (cid:107)2 is the spectral norm. Then for all \u02c6y \u2208 Y,\n\n| p(y = \u02c6y | x, \u02dcx) \u2212 psoft(y = \u02c6y | x, \u02dcx)| \u2264 c\n\nThe proof is given in Appendix A.4 Empirically the soft approximation works remarkably well, and\noften moves towards a sharper distribution with training. Alignment distributions learned this way\noften correlate with human intuition (e.g. word alignment in machine translation) [38].5\n\nHard Attention Hard attention is an approximate inference approach for latent alignment (Fig-\nure 2a) [80, 4, 53, 26]. Hard attention takes a single hard sample of z (as opposed to a soft mixture)\nand then backpropagates through the model. The approach is derived by two choices: First ap-\nply Jensen\u2019s inequality to get a lower bound on the log marginal likelihood, log Ez[p(y | x, z)] \u2265\nEz[log p(y | x, z)], then maximize this lower-bound with policy gradients/REINFORCE [76] to obtain\nunbiased gradient estimates,\n\n\u2207\u03b8Ez[log f (x, z))] = Ez[\u2207\u03b8 log f (x, z) + (log f (x, z) \u2212 B)\u2207\u03b8 log p(z | x, \u02dcx)],\n\nwhere B is a baseline that can be used to reduce the variance of this estimator. To implement this\napproach ef\ufb01ciently, hard attention uses Monte Carlo sampling to estimate the expectation in the\ngradient computation. For ef\ufb01ciency, a single sample from p(z | x, \u02dcx) is used, in conjunction with\nother tricks to reduce the variance of the gradient estimator (discussed more below) [80, 50, 51].\n\n3 Variational Attention for Latent Alignment Models\n\nAmortized variational inference (AVI, closely related to variational auto-encoders) [36, 61, 50] is a\nclass of methods to ef\ufb01ciently approximate latent variable inference, using learned inference networks.\nIn this section we explore this technique for deep latent alignment models, and propose methods for\nvariational attention that combine the bene\ufb01ts of soft and hard attention.\nFirst note that the key approximation step in hard attention is to optimize a lower bound derived from\nJensen\u2019s inequality. This gap could be quite large, contributing to poor performance. 6 Variational\n\n4It is also possible to study the gap in \ufb01ner detail by considering distributions over the inputs of f that have\nhigh probability under approximately linear regions of f, leading to the notion of approximately expectation-\nlinear functions, which was originally proposed and studied in the context of dropout [46].\n\n5Another way of viewing soft attention is as simply a non-probabilistic learned function. While it is possible\nthat such models encode better inductive biases, our experiments show that when properly optimized, latent\nalignment attention with explicit latent variables do outperform soft attention.\n\n6Prior works on hard attention have generally approached the problem as a black-box reinforcement learning\nproblem where the rewards are given by log f (x, z). Ba et al. (2015) [4] and Lawson et al. (2017) [41] are\nthe notable exceptions, and both works utilize the framework from [51] which obtains multiple samples from a\nlearned sampling distribution to optimize the IWAE bound [12] or a reweighted wake-sleep objective.\n\n4\n\n\fAlgorithm 1 Variational Attention\n\n\u03bb \u2190 enc(x, \u02dcx, y; \u03c6) (cid:46) Compute var. params\nz \u223c q(z; \u03bb)\n(cid:46) Sample var. attention\n(cid:46) Compute output dist\nlog f (x, z)\nz(cid:48) \u2190 Ep(z(cid:48) | x,\u02dcx)[z(cid:48)]\n(cid:46) Compute soft atten.\nB = log f (x, z(cid:48))\n(cid:46) Compute baseline dist\nBackprop \u2207\u03b8 and \u2207\u03c6 based on eq. 1 and KL\n\nAlgorithm 2 Variational Relaxed Attention\n\nmax\u03b8 Ez\u223cp[log p(y | x, z)] (cid:46) Pretrain \ufb01xed \u03b8\n. . .\nu \u223c U\n(cid:46) Sample unparam.\nz \u2190 g\u03c6(u)\n(cid:46) Reparam sample\n(cid:46) Compute output dist\nlog f (x, z)\nBackprop \u2207\u03b8 and \u2207\u03c6, reparam and KL\n\ninference methods directly aim to tighten this gap. In particular, the evidence lower bound (ELBO)\nis a parameterized bound over a family of distributions q(z) \u2208 Q (with the constraint that the\nsupp q(z) \u2286 supp p(z | x, \u02dcx, y)),\n\nlog Ez\u223cp(z | x,\u02dcx)[p(y | x, z)] \u2265 Ez\u223cq(z)[log p(y | x, z)] \u2212 KL[q(z)(cid:107) p(z | x, \u02dcx)]\n\nThis allows us to search over variational distributions q to improve the bound. It is tight when the\nvariational distribution is equal to the posterior, i.e. q(z) = p(z | x, \u02dcx, y). Hard attention is a special\ncase of the ELBO with q(z) = p(z | x, \u02dcx).\nThere are many ways to optimize the evidence lower bound; an effective choice for deep learning\napplications is to use amortized variational inference. AVI uses an inference network to produce the\nparameters of the variational distribution q(z; \u03bb). The inference network takes in the input, query,\nand the output, i.e. \u03bb = enc(x, \u02dcx, y; \u03c6). The objective aims to reduce the gap with the inference\nnetwork \u03c6 while also training the generative model \u03b8,\n\nEz\u223cq(z;\u03bb)[log p(y | x, z)] \u2212 KL[q(z; \u03bb)(cid:107) p(z | x, \u02dcx)]\n\nmax\n\u03c6,\u03b8\n\nWith the right choice of optimization strategy and inference network this form of variational attention\ncan provide a general method for learning latent alignment models. In the rest of this section, we\nconsider strategies for accurately and ef\ufb01ciently computing this objective; in the next section, we\ndescribe instantiations of enc for speci\ufb01c domains.\nAlgorithm 1: Categorical Alignments First consider the case where D, the alignment distribution,\nand Q, the variational family, are categorical distributions. Here the generative assumption is that\ny is generated from a single index of x. Under this setup, a low-variance estimator of \u2207\u03b8ELBO, is\neasily obtained through a single sample from q(z). For \u2207\u03c6ELBO, the gradient with respect to the\nKL portion is easily computable, but there is an optimization issue with the gradient with respect to\nthe \ufb01rst term Ez\u223cq(z)[log f (x, z))].\nMany recent methods target this issue, including neural estimates of baselines [50, 51], Rao-\nBlackwellization [59], reparameterizable relaxations [31, 47], and a mix of various techniques\n[73, 24]. We found that an approach using REINFORCE [76] along with a specialized baseline was\neffective. However, note that REINFORCE is only one of the inference choices we can select, and\nas we will show later, alternative approaches such as reparameterizable relaxations work as well.\nFormally, we \ufb01rst apply the likelihood-ratio trick to obtain an expression for the gradient with respect\nto the inference network parameters \u03c6,\n\n\u2207\u03c6Ez\u223cq(z)[log p(y | x, z)] = Ez\u223cq(z)[(log f (x, z) \u2212 B)\u2207\u03c6 log q(z)]\n\nAs with hard attention, we take a single Monte Carlo sample (now drawn from the variational\ndistribution). Variance reduction of this estimate falls to the baseline term B. The ideal (and intuitive)\nbaseline would be Ez\u223cq(z)[log f (x, z)], analogous to the value function in reinforcement learning.\nWhile this term cannot be easily computed, there is a natural, cheap approximation: soft attention (i.e.\nlog f (x, E[z])). Then the gradient is\n\n(cid:19)\n\n(cid:21)\n\n(cid:20)(cid:18)\n\nEz\u223cq(z)\n\nlog\n\nf (x, z)\n\nf (x, Ez(cid:48)\u223cp(z(cid:48) | x,\u02dcx)[z(cid:48)])\n\n\u2207\u03c6 log q(z | x, \u02dcx)\n\n(1)\n\nEffectively this weights gradients to q based on the ratio of the inference network alignment approach\nto a soft attention baseline. Notably the expectation in the soft attention is over p (and not over q),\nand therefore the baseline is constant with respect to \u03c6. Note that a similar baseline can also be used\nfor hard attention, and we apply it to both variational/hard attention models in our experiments.\n\n5\n\n\fAlgorithm 2: Relaxed Alignments Next consider treating both D and Q as Dirichlets, where z\nrepresents a mixture of indices. This model is in some sense closer to the soft attention formulation\nwhich assigns mass to multiple indices, though fundamentally different in that we still formally treat\nalignment as a latent variable. Again the aim is to \ufb01nd a low variance gradient estimator. Instead of\nusing REINFORCE, certain continuous distributions allow the use reparameterization [36], where\nsampling z \u223c q(z) can be done by \ufb01rst sampling from a simple unparameterized distribution U, and\nthen applying a transformation g\u03c6(\u00b7), yielding an unbiased estimator,\n\nEu\u223cU [\u2207\u03c6 log p(y|x, g\u03c6(u))] \u2212 \u2207\u03c6 KL [q(z)(cid:107) p(z | x, \u02dcx)]\n\nThe Dirichlet distribution is not directly reparameterizable. While transforming the standard uniform\ndistribution with the inverse CDF of Dirichlet would result in a Dirichlet distribution, the inverse\nCDF does not have an analytical solution. However, we can use rejection based sampling to get a\nsample, and employ implicit differentiation to estimate the gradient of the CDF [32].\nEmpirically, we found the random initialization would result in convergence to uniform Dirichlet\nparameters for \u03bb. (We suspect that it is easier to \ufb01nd low KL local optima towards the center of the\nsimplex). In experiments, we therefore initialize the latent alignment model by \ufb01rst minimizing the\nJensen bound, Ez\u223cp(z | x,\u02dcx)[log p(y | x, z)], and then introducing the inference network.\n\n4 Models and Methods\n\nWe experiment with variational attention in two different domains where attention-based models are\nessential and widely-used: neural machine translation and visual question answering.\n\nNeural Machine Translation Neural machine translation (NMT) takes in a source sentence and\npredicts each word of a target sentence yj in an auto-regressive manner. The model \ufb01rst contextually\nembeds each source word using a bidirectional LSTM to produce the vectors x1 . . . xT . The query\n\u02dcx consists of an LSTM-based representation of the previous target words y1:j\u22121. Attention is used\nto identify which source positions should be used to predict the target. The parameters of D are\ngenerated from an MLP between the query and source [6], and f concatenates the selected xi with\nthe query \u02dcx and passes it to an MLP to produce the distribution over the next target word yj.\nFor variational attention, the inference network applies a bidirectional LSTM over the source and\nthe target to obtain the hidden states x1, . . . , xT and h1, . . . , hS, and produces the alignment scores\nat the j-th time step via a bilinear map, s(j)\nj Uxi). For the categorical case, the scores\nare normalized, q(z(j)\ni =\ns(j)\n. Note, the inference network sees the entire target (through bidirectional LSTMs). The word\ni\nembeddings are shared between the generative/inference networks, but other parameters are separate.\n\n; in the relaxed case the parameters of the Dirichlet are \u03b1(j)\n\ni = exp(h(cid:62)\n\ni = 1) \u221d s(j)\n\ni\n\nVisual Question Answering Visual question answering (VQA) uses attention to locate the parts of\nan image that are necessary to answer a textual question. We follow the recently-proposed \u201cbottom-up\ntop-down\u201d attention approach [2], which uses Faster R-CNN [60] to obtain object bounding boxes\nand performs mean-pooling over the convolutional features (from a pretrained ResNet-101 [27]) in\neach bounding box to obtain object representations x1, . . . , xT . The query \u02dcx is obtained by running\nan LSTM over the question, the attention function a passes the query and the object representation\nthrough an MLP. The prediction function f is also similar to the NMT case: we concatenate the\nchosen xi with the query \u02dcx to use as input to an MLP which produces a distribution over the output.\nThe inference network enc uses the answer embedding hy and combines it with xi and \u02dcx to produce\nthe variational (categorical) distribution,\n\nq(zi = 1) \u221d exp(u(cid:62) tanh(U1(xi (cid:12) ReLU(V1hy)) + U2(\u02dcx (cid:12) ReLU(V2hy))))\n\nwhere (cid:12) is the element-wise product. This parameterization worked better than alternatives. We did\nnot experiment with the relaxed case in VQA, as the object bounding boxes already give us the ability\nto attend to larger portions of the image.\n\nInference Alternatives For categorical alignments we described maximizing a particular varia-\ntional lower bound with REINFORCE. Note that other alternatives exist, and we brie\ufb02y discuss them\n\n6\n\n\fhere: 1) instead of the single-sample variational bound we can use a multiple-sample importance\nsampling based approach such as Reweighted Wake-Sleep (RWS) [4] or VIMCO [52]; 2) instead of\nREINFORCE we can approximate sampling from the discrete categorical distribution with Gumbel-\nSoftmax [30]; 3) instead of using an inference network we can directly apply Stochastic Variational\nInference (SVI) [28] to learn the local variational parameters in the posterior.\n\ntest\n\ntime, we need to marginalize out\n\nPredictive Inference At\ni.e.\nEz[p(y | x, \u02dcx, z)] using p(z | x, \u02dcx). In the categorical case, if speed is not an issue then enumer-\nating alignments is preferable, which incurs a multiplicative cost of O(T ) (but the enumeration is\nparallelizable). Alternatively we experimented with a K-max renormalization, where we only take\nthe top-K attention scores to approximate the attention distribution (by re-normalizing). This makes\nthe multiplicative cost constant with respect to T . For the relaxed case, sampling is necessary.\n\nthe latent variables,\n\n5 Experiments\n\nSetup For NMT we mainly use the IWSLT dataset [13]. This dataset is relatively small, but has\nbecome a standard benchmark for experimental NMT models. We follow the same preprocessing as\nin [21] with the same Byte Pair Encoding vocabulary of 14k tokens [65]. To show that variational\nattention scales to large datasets, we also experiment on the WMT 2017 English-German dataset [8],\nfollowing the preprocessing in [74] except that we use newstest2017 as our test set. For VQA, we use\nthe VQA 2.0 dataset. As we are interested in intrinsic evaluation (i.e. log-likelihood) in addition to\nthe standard VQA metric, we randomly select half of the standard validation set as the test set (since\nwe need access to the actual labels).7 (Therefore the numbers provided are not strictly comparable to\nexisting work.) While the preprocessing is the same as [2], our numbers are worse than previously\nreported as we do not apply any of the commonly-utilized techniques to improve performance on\nVQA such as data augmentation and label smoothing.\nExperiments vary three components of the systems: (a) training objective and model, (b) training\napproximations, comparing enumeration or sampling,8 (c) test inference. All neural models have the\nsame architecture and the exact same number of parameters \u03b8 (the inference network parameters \u03c6\nvary, but are not used at test). When training hard and variational attention with sampling both use\nthe same baseline, i.e the output from soft attention. The full architectures/hyperparameters for both\nNMT and VQA are given in Appendix B.\n\nResults and Discussion Table 1 shows the main results. We \ufb01rst note that hard attention underper-\nforms soft attention, even when its expectation is enumerated. This indicates that Jensen\u2019s inequality\nalone is a poor bound. On the other hand, on both experiments, exact marginal likelihood outperforms\nsoft attention, indicating that when possible it is better to have latent alignments.\nFor NMT, on the IWSLT 2014 German-English task, variational attention with enumeration and\nsampling performs comparably to optimizing the log marginal likelihood, despite the fact that it is\noptimizing a lower bound. We believe that this is due to the use of q(z), which conditions on the\nentire source/target and therefore potentially provides better training signal to p(z | x, \u02dcx) through the\nKL term. Note that it is also possible to have q(z) come from a pretrained external model, such as\na traditional alignment model [20]. Table 3 (left) shows these results in context compared to the\nbest reported values for this task. Even with sampling, our system improves on the state-of-the-art.\nOn the larger WMT 2017 English-German task, the superior performance of variational attention\npersists: our baseline soft attention reaches 24.10 BLEU score, while variational attention reaches\n24.98. Note that this only re\ufb02ects a reasonable setting without exhaustive tuning, yet we show that\nwe can train variational attention at scale. For VQA the trend is largely similar, and results for NLL\nwith variational attention improve on soft attention and hard attention. However the task-speci\ufb01c\nevaluation metrics are slightly worse.\nTable 2 (left) considers test inference for variational attention, comparing enumeration to K-max with\nK = 5. For all methods exact enumeration is better, however K-max is a reasonable approximation.\n, 1}. Also note that since there are sometimes\nmultiple answers for a given question, in such cases we sample (where the sampling probability is proportional\nto the number of humans that said the answer) to get a single label.\n\n7 VQA eval metric is de\ufb01ned as min{ # humans that said answer\n\n3\n\n8Note that enumeration does not imply exact if we are enumerating an expectation on a lower bound.\n\n7\n\n\fModel\nObjective\nlog p(y | E[z])\nSoft Attention\nlog E[p]\nMarginal Likelihood\nEp[log p]\nHard Attention\nEp[log p]\nHard Attention\nVariational Relaxed Attention Eq[log p] \u2212 KL\nEq[log p] \u2212 KL\nVariational Attention\nEq[log p] \u2212 KL\nVariational Attention\n\nE\n\n-\n\nEnum\nEnum\nSample\nSample\nEnum\nSample\n\nNMT\n\nPPL BLEU\n32.77\n7.17\n33.29\n6.34\n31.40\n7.37\n7.38\n31.00\n30.05\n7.58\n33.68\n6.08\n6.17\n33.30\n\nVQA\n\nNLL\n1.76\n1.69\n1.78\n1.82\n-\n1.69\n1.75\n\nEval\n58.93\n60.33\n57.60\n56.30\n-\n58.44\n57.52\n\nTable 1: Evaluation on NMT and VQA for the various models. E column indicates whether the expectation\nis calculated via enumeration (Enum) or a single sample (Sample) during training. For NMT we evaluate\nintrinsically on perplexity (PPL) (lower is better) and extrinsically on BLEU (higher is better), where for BLEU\nwe perform beam search with beam size 10 and length penalty (see Appendix B for further details). For VQA\nwe evaluate intrinsically on negative log-likelihood (NLL) (lower is better) and extrinsically on VQA evaluation\nmetric (higher is better). All results except for relaxed attention use enumeration at test time.\n\nModel\nMarginal Likelihood\nHard + Enum\nHard + Sample\nVariational + Enum\nVariational + Sample\n\nPPL\n\nExact K-Max\n6.34\n7.37\n7.38\n6.08\n6.17\n\n6.90\n7.37\n7.38\n6.42\n6.51\n\nBLEU\n\nExact K-Max\n33.29\n33.31\n31.37\n31.40\n31.04\n31.00\n33.69\n33.68\n33.30\n33.27\n\nTable 2: (Left) Performance change on NMT from exact decoding to K-Max decoding with K = 5. (see section\n5 for de\ufb01nition of K-max decoding). (Right) Test perplexity of different approaches while varying K to estimate\nEz[p(y|x, \u02dcx)]. Dotted lines compare soft baseline and variational with full enumeration.\n\nTable 2 (right) shows the PPL of different models as we increase K. Good performance requires\nK > 1, but we only get marginal bene\ufb01ts for K > 5. Finally, we observe that it is possible to train\nwith soft attention and test using K-Max with a small performance drop (Soft KMax in Table 2\n(right)). This possibly indicates that soft attention models are approximating latent alignment models.\nOn the other hand, training with latent alignments and testing with soft attention performed badly.\nTable 3 (lower right) looks at the entropy of the prior distribution learned by the different models.\nNote that hard attention has very low entropy (high certainty) whereas soft attention is quite high.\nThe variational attention model falls in between. Figure 3 (left) illustrates the difference in practice.\nTable 3 (upper right) compares inference alternatives for variational attention. RWS reaches a\ncomparable performance as REINFORCE, but at a higher memory cost as it requires multiple\nsamples. Gumbel-Softmax reaches nearly the same performance and seems like a viable alternative;\nalthough we found its performance is sensitive to its temperature parameter. We also trained a\nnon-amortized SVI model, but found that at similar runtime it was not able to produce satisfactory\nresults, likely due to insuf\ufb01cient updates of the local variational parameters. A hybrid method such as\nsemi-amortized inference [39, 34] might be a potential future direction worth exploring.\nDespite extensive experiments, we found that variational relaxed attention performed worse than other\nmethods. In particular we found that when training with a Dirichlet KL, it is hard to reach low-entropy\nregions of the simplex, and the attentions are more uniform than either soft or variational categorical\nattention. Table 3 (lower right) quanti\ufb01es this issue. We experimented with other distributions such\nas Logistic-Normal and Gumbel-Softmax [31, 47] but neither \ufb01xed this issue. Others have also noted\ndif\ufb01culty in training Dirichlet models with amortized inference [69].\nBesides performance, an advantage of these models is the ability to perform posterior inference, since\nthe q function can be used directly to obtain posterior alignments. Contrast this with hard attention\nwhere q = p(z | x, \u02dcx), i.e. the variational posterior is independent of the future information. Figure 3\nshows the alignments of p and q for variational attention over a \ufb01xed sentence (see Appendix C for\nmore examples). We see that q is able to use future information to correct alignments. We note that\nthe inability of soft and hard attention to produce good alignments has been noted as a major issue\nin NMT [38]. While q is not used directly in left-to-right NMT decoding, it could be employed for\nother applications such as in an iterative re\ufb01nement approach [56, 42].\n\n8\n\n\fFigure 3: (Left) An example demonstrating the difference between the prior alignment (red) and the variational\nposterior (blue) when translating from DE-EN (left-to-right). Note the improved blue alignments for actually\nand violent which bene\ufb01t from seeing the next word. (Right) Comparison of soft attention (green) with the p\nof variational attention (red). Both models imply a similar alignment, but variational attention has lower entropy.\n\nModel\nBeam Search Optimization [77]\nActor-Critic [5]\nNeural PBMT + LM [29]\nMinimum Risk Training [21]\nSoft Attention\nMarginal Likelihood\nHard Attention + Enum\nHard Attention + Sample\nVariational Relaxed Attention\nVariational Attention + Enum\nVariational Attention + Sample\n\nIWSLT\nBLEU\n26.36\n28.53\n30.08\n32.84\n32.77\n33.29\n31.40\n30.42\n30.05\n33.69\n33.30\n\nInference Method\nREINFORCE\nRWS\nGumbel-Softmax\n\n#Samples\n\n1\n5\n1\n\nPPL\n6.17\n6.41\n6.51\n\nBLEU\n33.30\n32.96\n33.08\n\nModel\nSoft Attention\nMarginal Likelihood\nHard Attention + Enum\nHard Attention + Sample\nVariational Relaxed Attention\nVariational Attention + Enum\nVariational Attention + Sample\n\nEntropy\n\n-\n\nNMT VQA\n1.24\n2.70\n2.66\n0.82\n0.73\n0.05\n0.07\n0.58\n2.02\n0.54\n0.52\n\n2.07\n2.44\n\nTable 3: (Left) Comparison against the best prior work for NMT on the IWSLT 2014 German-English test set.\n(Upper Right) Comparison of inference alternatives of variational attention on IWSLT 2014. (Lower Right)\nComparison of different models in terms of implied discrete entropy (lower = more certain alignment).\n\nPotential Limitations While this technique is a promising alternative to soft attention, there are\nsome practical limitations: (a) Variational/hard attention needs a good baseline estimator in the form\nof soft attention. We found this to be a necessary component for adequately training the system. This\nmay prevent this technique from working when T is intractably large and soft attention is not an\noption. (b) For some applications, the model relies heavily on having a good posterior estimator. In\nVQA we had to utilize domain structure for the inference network construction. (c) Recent models\nsuch as the Transformer [74], utilize many repeated attention models. For instance the current best\ntranslation models have the equivalent of 150 different attention queries per word translated. It is\nunclear if this approach can be used at that scale as predictive inference becomes combinatorial.\n\n6 Conclusion\n\nAttention methods are ubiquitous tool for areas like natural language processing; however they\nare dif\ufb01cult to use as latent variable models. This work explores alternative approaches to latent\nalignment, through variational attention with promising result. Future work will experiment with\nscaling the method on larger-scale tasks and in more complex models, such as multi-hop attention\nmodels, transformer models, and structured models, as well as utilizing these latent variables for\ninterpretability and as a way to incorporate prior knowledge.\n\n9\n\n\fAcknowledgements\n\nWe are grateful to Sam Wiseman and Rachit Singh for insightful comments and discussion, as well as\nChristian Puhrsch for help with translations. This project was supported by a Facebook Research\nAward (Low Resource NMT). YK is supported by a Google AI PhD Fellowship. YD is supported by\na Bloomberg Research Award. AMR gratefully acknowledges the support of NSF CCF-1704834 and\nan Amazon AWS Research award.\n\nReferences\n[1] David Alvarez-Melis and Tommi S Jaakkola. A Causal Framework for Explaining the Predictions of\n\nBlack-Box Sequence-to-Sequence Models. In Proceddings of EMNLP, 2017.\n\n[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and Top-Down Attention for Image Captioning and Visual Question Answering. In\nProceedings of CVPR, 2018.\n\n[3] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple Object Recognition with Visual Attention.\n\nIn Proceedings of ICLR, 2015.\n\n[4] Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey. Learning Wake-Sleep Recurrent\n\nAttention Models. In Proceedings of NIPS, 2015.\n\n[5] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron\nCourville, and Yoshua Bengio. An Actor-Critic Algorithm for Sequence Prediction. In Proceedings of\nICLR, 2017.\n\n[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning\n\nto Align and Translate. In Proceedings of ICLR, 2015.\n\n[7] Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, and Pascal Poupart. Variational Attention for Sequence-\n\nto-Sequence Models. arXiv:1712.08207, 2017.\n\n[8] Ond\u02c7rej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow,\nMatthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer. Proceedings of the second\nconference on machine translation. In Proceedings of the Second Conference on Machine Translation.\nAssociation for Computational Linguistics, 2017.\n\n[9] Jorg Bornschein, Andriy Mnih, Daniel Zoran, and Danilo J. Rezende. Variational Memory Addressing in\n\nGenerative Models. In Proceedings of NIPS, 2017.\n\n[10] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The Mathematics of\nStatistical Machine Translation: Parameter Estimation. Computational linguistics, 19(2):263\u2013311, 1993.\n\n[11] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics\n\nof statistical machine translation: Parameter estimation. Comput. Linguist., 19(2):263\u2013311, June 1993.\n\n[12] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In Proceedings\n\nof ICLR, 2015.\n\n[13] Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the\n\n11th IWSLT evaluation campaign. In Proceedings of IWSLT, 2014.\n\n[14] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, Attend and Spell. arXiv:1508.01211,\n\n2015.\n\n[15] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attention-\n\nbased Encoder-Decoder Networks. In IEEE Transactions on Multimedia, 2015.\n\n[16] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-\n\nBased Models for Speech Recognition. In Proceedings of NIPS, 2015.\n\n[17] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A\n\nRecurrent Latent Variable Model for Sequential Data. In Proceedings of NIPS, 2015.\n\n[18] Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza\nHaffari. Incorporating Structural Alignment Biases into an Attentional Neural Translation Model. In\nProceedings of NAACL, 2016.\n\n10\n\n\f[19] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-Markup Generation with\n\nCoarse-to-Fine Attention. In Proceedings of ICML, 2017.\n\n[20] Chris Dyer, Victor Chahuneau, and Noah A. Smith. A Simple, Fast, and Effective Reparameterization of\n\nIBM Model 2. In Proceedings of NAACL, 2013.\n\n[21] Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc\u2019Aurelio Ranzato. Classical Structured\n\nPrediction Losses for Sequence to Sequence Learning. In Proceedings of NAACL, 2018.\n\n[22] Marco Fraccaro, Soren Kaae Sonderby, Ulrich Paquet, and Ole Winther. Sequential Neural Models with\n\nStochastic Layers. In Proceedings of NIPS, 2016.\n\n[23] Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Cote, Nan Rosemary Ke, and Yoshua Bengio.\n\nZ-Forcing: Training Stochastic Recurrent Networks. In Proceedings of NIPS, 2017.\n\n[24] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through\nthe Void: Optimizing control variates for black-box gradient estimation. In Proceedings of ICLR, 2018.\n\n[25] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating Copying Mechanism in Sequence-to-\n\nSequence Learning. 2016.\n\n[26] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic Neural Turing Machine\n\nwith Soft and Hard Addressing Schemes. arXiv:1607.00036, 2016.\n\n[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.\n\nIn Proceedings of CVPR, 2016.\n\n[28] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[29] Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural phrase-based\n\nmachine translation. In Proceedings of ICLR, 2018.\n\n[30] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[31] Eric Jang, Shixiang Gu, and Ben Poole. Categorical Reparameterization with Gumbel-Softmax.\n\nProceedings of ICLR, 2017.\n\nIn\n\n[32] Martin Jankowiak and Fritz Obermeyer. Pathwise Derivatives Beyond the Reparameterization Trick. In\n\nProceedings of ICML, 2018.\n\n[33] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured Attention Networks. In\n\nProceedings of ICLR, 2017.\n\n[34] Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semi-amortized\n\nvariational autoencoders. arXiv preprint arXiv:1802.02550, 2018.\n\n[35] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of\n\nICLR, 2015.\n\n[36] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of ICLR, 2014.\n\n[37] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi,\nBrooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical\nmachine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and\ndemonstration sessions, pages 177\u2013180. Association for Computational Linguistics, 2007.\n\n[38] Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation. arXiv:1706.03872,\n\n2017.\n\n[39] Rahul G. Krishnan, Dawen Liang, and Matthew Hoffman. On the Challenges of Learning with Inference\n\nNetworks on Sparse, High-dimensional Data. In Proceedings of AISTATS, 2018.\n\n[40] Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured Inference Networks for Nonlinear State\n\nSpace Models. In Proceedings of AAAI, 2017.\n\n[41] Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, and Navdeep Jaitly.\n\nLearning Hard Alignments in Variational Inference. In Proceedings of ICASSP, 2018.\n\n11\n\n\f[42] Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic Non-Autoregressive Neural Sequence\n\nModeling by Iterative Re\ufb01nement. arXiv:1802.06901, 2018.\n\n[43] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing Neural Rredictions. In Proceedings of\n\nEMNLP, 2016.\n\n[44] Yang Liu and Mirella Lapata. Learning Structured Text Representations. In Proceedings of TACL, 2017.\n\n[45] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based\n\nNeural Machine Translation. In Proceedings of EMNLP, 2015.\n\n[46] Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. Dropout with\n\nExpectation-linear Regularization. In Proceedings of ICLR, 2017.\n\n[47] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation\n\nof Discrete Random Variables. In Proceedings of ICLR, 2017.\n\n[48] Andr\u00e9 F. T. Martins and Ram\u00f3n Fernandez Astudillo. From Softmax to Sparsemax: A Sparse Model of\n\nAttention and Multi-Label Classi\ufb01cation. In Proceedings of ICML, 2016.\n\n[49] Arthur Mensch and Mathieu Blondel. Differentiable Dynamic Programming for Structured Prediction and\n\nAttention. In Proceedings of ICML, 2018.\n\n[50] Andriy Mnih and Karol Gregor. Neural Variational Inference and Learning in Belief Networks.\n\nProceedings of ICML, 2014.\n\nIn\n\n[51] Andriy Mnih and Danilo J. Rezende. Variational Inference for Monte Carlo Objectives. In Proceedings of\n\nICML, 2016.\n\n[52] Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint\n\narXiv:1602.06725, 2016.\n\n[53] Volodymyr Mnih, Nicola Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent Models of Visual\n\nAttention. In Proceedings of NIPS, 2015.\n\n[54] Vlad Niculae and Mathieu Blondel. A Regularized Framework for Sparse and Structured Neural Attention.\n\nIn Proceedings of NIPS, 2017.\n\n[55] Vlad Niculae, Andr\u00e9 F. T. Martins, Mathieu Blondel, and Claire Cardie. SparseMAP: Differentiable Sparse\n\nStructured Inference. In Proceedings of ICML, 2018.\n\n[56] Roman Novak, Michael Auli, and David Grangier.\n\narXiv:1610.06602, 2016.\n\nIterative Re\ufb01nement for Machine Translation.\n\n[57] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word\n\nRepresentation. In Proceedings of EMNLP, 2014.\n\n[58] Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and Linear-Time\n\nAttention by Enforcing Monotonic Alignments. In Proceedings of ICML, 2017.\n\n[59] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black Box Variational Inference. In Proceedings of\n\nAISTATS, 2014.\n\n[60] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object\n\nDetection with Region Proposal Networks. In Proceedings of NIPS, 2015.\n\n[61] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approxi-\n\nmate Inference in Deep Generative Models. In Proceedings of ICML, 2014.\n\n[62] Tim Rockt\u00e4schel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. Reason-\n\ning about Entailment with Neural Attention. In Proceedings of ICLR, 2016.\n\n[63] Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence\n\nSummarization. In Proceedings of EMNLP, 2015.\n\n[64] Philip Schulz, Wilker Aziz, and Trevor Cohn. A Stochastic Decoder for Neural Machine Translation. In\n\nProceedings of ACL, 2018.\n\n[65] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with\n\nSubword Units. In Proceedings of ACL, 2016.\n\n12\n\n\f[66] Iulian Vlad Serban, Alessandro Sordoni, Laurent Charlin Ryan Lowe, Joelle Pineau, Aaron Courville, and\nYoshua Bengio. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In\nProceedings of AAAI, 2017.\n\n[67] Shiv Shankar, Siddhant Garg, and Sunita Sarawagi. Surprisingly Easy Hard-Attention for Sequence to\n\nSequence Learning. In Proceedings of EMNLP, 2018.\n\n[68] Bonggun Shin, Falgun H Chokshi, Timothy Lee, and Jinho D Choi. Classi\ufb01cation of Radiology Reports\n\nUsing Neural Attention Models. In Proceedings of IJCNN, 2017.\n\n[69] Akash Srivastava and Charles Sutton. Autoencoding Variational Inference for Topic Models. In Proceed-\n\nings of ICLR, 2017.\n\n[70] Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. Variational Recurrent Neural\n\nMachine Translation. In Proceedings of AAAI, 2018.\n\n[71] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-To-End Memory Networks. In\n\nProceedings of NIPS, 2015.\n\n[72] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling Coverage for Neural\n\nMachine Translation. In Proceedings of ACL, 2016.\n\n[73] George Tucker, Andriy Mnih, Chris J. Maddison, Dieterich Lawson, and Jascha Sohl-Dickstein. REBAR:\nLow-variance, Unbiased Gradient Estimates for Discrete Latent Variable Models. In Proceedings of NIPS,\n2017.\n\n[74] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is All You Need. In Proceedings of NIPS, 2017.\n\n[75] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based Word Alignment in Statistical\n\nTranslation. In Proceedings of COLING, 1996.\n\n[76] Ronald J. Williams. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement\n\nLearning. Machine Learning, 8, 1992.\n\n[77] Sam Wiseman and Alexander M. Rush. Sequence-to-Sequence learning as Beam Search Optimization. In\n\nProceedings of EMNLP, 2016.\n\n[78] Shijie Wu, Pamela Shapiro, and Ryan Cotterell. Hard Non-Monotonic Attention for Character-Level\n\nTransduction. In Proceedings of EMNLP, 2018.\n\n[79] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Klaus Macherey Qin Gao, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu,\nLukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, Nishant Patil\nGeorge Kurian, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg\nCorrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s Neural Machine Translation System: Bridging the\nGap between Human and Machine Translation. arXiv:1609.08144, 2016.\n\n[80] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel,\nand Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In\nProceedings of ICML, 2015.\n\n[81] Zichao Yang, Kiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked Attention Networks for\n\nImage Question Answering. In Proceedings of CVPR, 2016.\n\n[82] Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomas Kocisky. The Neural Noisy Channel.\n\nIn Proceedings of ICLR, 2017.\n\n[83] Lei Yu, Jan Buys, and Phil Blunsom. Online Segment to Segment Neural Transduction. In Proceedings of\n\nEMNLP, 2016.\n\n[84] Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, and Min Zhang. Variational Neural Machine Translation.\n\nIn Proceedings of EMNLP, 2016.\n\n[85] Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. Structured Attentions for Visual Question\n\nAnswering. In Proceedings of ICCV, 2017.\n\n13\n\n\f", "award": [], "sourceid": 6253, "authors": [{"given_name": "Yuntian", "family_name": "Deng", "institution": "Harvard University"}, {"given_name": "Yoon", "family_name": "Kim", "institution": "Harvard University"}, {"given_name": "Justin", "family_name": "Chiu", "institution": "Harvard University"}, {"given_name": "Demi", "family_name": "Guo", "institution": "Harvard"}, {"given_name": "Alexander", "family_name": "Rush", "institution": "Harvard"}]}