{"title": "Uncertainty-Aware Attention for Reliable Interpretation and Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 909, "page_last": 918, "abstract": "Attention mechanism is effective in both focusing the deep learning models on relevant features and interpreting them. However, attentions may be unreliable since the networks that generate them are often trained in a weakly-supervised manner. To overcome this limitation, we introduce the notion of input-dependent uncertainty to the attention mechanism, such that it generates attention for each feature with varying degrees of noise based on the given input, to learn larger variance on instances it is uncertain about. We learn this Uncertainty-aware Attention (UA) mechanism using variational inference, and validate it on various risk prediction tasks from electronic health records on which our model significantly outperforms existing attention models. The analysis of the learned attentions shows that our model generates attentions that comply with clinicians' interpretation, and provide richer interpretation via learned variance. Further evaluation of both the accuracy of the uncertainty calibration and the prediction performance with \"I don't know'' decision show that UA yields networks with high reliability as well.", "full_text": "Uncertainty-Aware Attention for\n\nReliable Interpretation and Prediction\n\nJay Heo1,2,4\u2217, Hae Beom Lee1,2\u2217, Saehoon Kim2, Juho Lee2,5, Kwang Joon Kim3,\n\nEunho Yang1,2, Sung Ju Hwang1,2\n\nKAIST1, AItrics2, Yonsei University College of Medicine3, UNIST4, South Korea,\n\n{jayheo, haebeom.lee, sjhwang82, eunhoy}@kaist.ac.kr\n\nUniversity of Oxford5, United Kingdom,\n\nshkim@aitrics.com, preppie@yuhs.ac, juho.lee@stats.ox.ac.uk\n\nAbstract\n\nAttention mechanism is effective in both focusing the deep learning models on\nrelevant features and interpreting them. However, attentions may be unreliable since\nthe networks that generate them are often trained in a weakly-supervised manner.\nTo overcome this limitation, we introduce the notion of input-dependent uncertainty\nto the attention mechanism, such that it generates attention for each feature with\nvarying degrees of noise based on the given input, to learn larger variance on\ninstances it is uncertain about. We learn this Uncertainty-aware Attention (UA)\nmechanism using variational inference, and validate it on various risk prediction\ntasks from electronic health records on which our model signi\ufb01cantly outperforms\nexisting attention models. The analysis of the learned attentions shows that our\nmodel generates attentions that comply with clinicians\u2019 interpretation, and provide\nricher interpretation via learned variance. Further evaluation of both the accuracy\nof the uncertainty calibration and the prediction performance with \u201cI don\u2019t know\u201d\ndecision show that UA yields networks with high reliability as well.\n\n1\n\nIntroduction\n\nFor many real-world safety-critical tasks, achieving high reliablity may be the most important\nobjective when learning predictive models for them, since incorrect predictions could potentially\nlead to severe consequences. For instance, failure to correctly predict the sepsis risk of a patient in\nICU may cost his/her life. Deep learning models, while having achieved impressive performances\non multitudes of real-world tasks such as visual recognition [17, 10], machine translation [2] and\nrisk prediction for healthcare [3, 4], may be still susceptible to such critical mistakes since most do\nnot have any notion of predictive uncertainty, often leading to overcon\ufb01dent models [9, 18] that are\nprone to making mistakes. Even worse, they are very dif\ufb01cult to analyze, due to multiple layers of\nnon-linear transformations that involves large number of parameters.\nAttention mechanism [2] is an effective means of guiding the model to focus on a partial set of most\nrelevant features for each input instance. It works by generating (often sparse) coef\ufb01cients for the\ngiven features in an input-adaptive manner, to allocate more weights to the features that are found to\nbe relevant for the given input. Attention mechanism has been shown to signi\ufb01cantly improve the\nmodel performance for machine translation [2] and image annotation [28] tasks. Another important\nfeature of the attention mechanism is that it allows easy interpretation of the model via the generated\nattention allocations, and one recent work on healthcare domain [3] is focusing on this aspect.\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Deterministic Attention [3]\n\n(b) Stochastic Attention [28]\n\n(c) Uncertainty-aware Attention (Ours)\n\nFigure 1: Reliability diagrams [9] which shows the accuracy as a function of model con\ufb01dence, generated\nfrom RNNs trained for mortality risk analysis from ICU records (PhysioNet-Mortality). ECE [22] in (8) denotes\nExpected Calibration Error, which is the weighted-average gap between model con\ufb01dence and actual accuracy.\n(Gap is shown in green bars.) Conventional attention models result in poorly calibrated networks while our UA\nyields a well-calibrated one. Such accurately calibrated networks allow us to perform reliable prediction by\nleveraging prediction con\ufb01dence to decide whether to predict or defer prediction.\n\nAlthough interpretable, attention mechanisms are still limited as means of implementing safe deep\nlearning models for safety-critical tasks, as they are not necessarily reliable. The attention strengths\nare commonly generated from a model that is trained in a weakly-supervised manner, and could\nbe incorrectly allocated; thus they may not be safe to base \ufb01nal prediction on. To build a reliable\nmodel that can prevent itself from making critical mistakes, we need a model that knows its own\nlimitation - when it is safe to make predictions and when it is not. However, existing attention model\ncannot handle this issue as they do not have any notion of predictive uncertainty. This problem is less\nof an issue in the conventional use of attention mechanisms, such as machine translation or image\nannotation, where we can often \ufb01nd clear link between the attended parts and the generated output.\nHowever, when working with variables that are often noisy and may not be one-to-one matched with\nthe prediction, such as in case of risk predictions with electronic health records, the overcon\ufb01dent\nand inaccurate attentions can lead to incorrect predictions (See Figure 1).\nTo tackle this limitation of conventional attention mechanisms, we propose to allow the attention\nmodel to output uncertainty on each feature (or input) and further leverage them when making\n\ufb01nal predictions. Speci\ufb01cally, we model the attention weights as Gaussian distribution with input-\ndependent noise, such that the model generates attentions with small variance when it is con\ufb01dent\nabout the contribution of the given features, and allocates noisy attentions with large variance to un-\ncertain features, for each input. This input-adaptive noise can model heteroscedastic uncertainty [14]\nthat varies based on the instance, which in turn results in uncertainty-based attenuation of atten-\ntion strength. We formulate this novel uncertainty-aware attention (UA) model under the Bayesian\nframework and solve it with variational inference2.\nWe validate UA on tasks such as sepsis prediction in ICU and disease risk prediction from electronic\nhealth records (EHR) that have large degree of uncertainties in the input, on which our model\noutperforms the baseline attention models by large margins. Further quantitative and qualitative\nanalysis of the learned attentions and their uncertainties show that our model can also provide richer\ninterpretations that align well with the clinician\u2019s interpretations. For further validation on prediction\nreliability, we evaluate it for the uncertainty calibration performance, and prediction under the\nscenario where the model can defer the decision by saying \u201cI don\u2019t know\u201d, whose results show that\nUA yields signi\ufb01cantly better calibrated networks that can better avoid making incorrect predictions\non instances that it is uncertain, compared to baseline attention models.\nOur contribution in this paper is threefold:\n\n\u2022 We propose a novel variational attention model with instance-dependent modeling of vari-\n\nance, that captures input-level uncertainty and use it to attenuate attention strengths.\n\n\u2022 We show that our uncertainty-aware attention yields accurate calibration of model uncertainty\n\nas well as attentions that aligns well with human interpretations.\n\n\u2022 We validate our model on six real-world risk prediction problems in healthcare domains, for\nboth the original binary classi\ufb01cation task and classi\ufb01cation with \u201cI don\u2019t know\" decision,\nand show that our model obtains signi\ufb01cant improvements over existing attention models.\n\n2The source codes are publicly available at https://github.com/jayheo/UA.\n\n2\n\n0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0AccuracyECE=6.88OutputsGap0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0AccuracyECE=7.54OutputsGap0.00.20.40.60.81.0Confidence0.00.20.40.60.81.0AccuracyECE=2.31OutputsGap\f2 Related Work\n\nPrediction reliability There has been work on building a reliable deep learning model[29, 13, 14];\nthat is, a deep network that can avoid making incorrect predictions when it is not suf\ufb01ciently certain\nabout its prediction. To achieve this goal, a model should know the limitation in the data, and in\nitself. One way to quantify such limitations is by measuring the predictive uncertainty using Bayesian\nmodels. Recently, [7, 5, 6] showed that deep networks with dropout sampling [24] can be understood\nas Bayesian neural networks. To obtain better calibrated dropout uncertainties, [15, 8] proposed\nto automatically learn the dropout rates with proper reparameterization tricks [21, 16]. While the\naformentioned work mostly focus on accurate calibration of uncertainty itself, Kendall and Gal [14]\nutilized dropout sampling to model predictive uncertainty in computer vision [13, 26], and also\nmodeled label noise with learned variances, to implicitly attenuate loss for the highly uncertain\ninstances. Our work has similar motivation, but we model the uncertainty in the input data rather than\nin labels. By doing so, we can accurately calibrate deep networks for improved reliability. Ayhan et\nal. [1] has a similar motivation to ours, but with different applications and approaches. There exists\nquite a few work about uncertainty calibration and its quanti\ufb01cation. Guo et al. [9] showed that the\nmodern deep networks are poorly calibrated despite their accuracies, and proposed to tune factors\nsuch as depth, width, weight decay for better calibration of the model, and Lakshminarayanan et\nal. [18] proposed ensemble and adversarial training for the same objective.\n\nAttention mechanism The literature on the attention mechanism is vast, which includes its appli-\ncation to machine translation [2], memory-augmented networks [25], and for image annotation [28].\nAttention mechanisms are also used for interpretability, as in Choi et al. [3] which proposed a RNN-\nbased attention generator for EHR that can provide attention on both the hospital visits and variables\nfor further analysis by clincians. Attentions can be either deterministic or probabilistic, and soft\n(non-sparse) or hard (sparse). Some probabilistic attention models [28] use variational inference as\nused in our model. However, while their direct learning of multinoulli distribution only considers\nwhether to attend or not without consideration of variance, our attention mechanism models varying\ndegree of uncertainty for each input by input-dependent learning of attention noise (variance).\n\nthe context vector c \u2208 Rr is computed as c(x) =(cid:80)i\n\n3 Approach\nWe now describe our uncertainty-aware attention model. Let D be a dataset containing a set of\nN input data points X = [x(1) . . . x(N )] and the corresponding labels, Y = [y(1) . . . y(N )]. For\nnotational simplicity, we suppress the data index n = 1, . . . , N when it is clear from the context.\nWe \ufb01rst present a general framework of a stochastic attention mechanism. Let v(x) \u2208 Rr\u00d7i be the\nconcatenation of i intermediate features, each column of which vj(x) is a length r vector, from an\narbitrary neural network. From v(x), a set of random variables {aj}i\nj=1 is conditionally generated\nfrom some distribution p(a|x) where the dimension of aj depends on the model architecture. Then,\nj=1 aj (cid:12) vj(x) where the operator (cid:12) is properly\nde\ufb01ned according to the dimensionality of aj; if aj is a scalar, it is simply the multiplication while for\naj \u2208 Rr, it is the element-wise product. The function f here produces the prediction \u02c6y = f (c(x))\ngiven the context vector c.\nThe attention could be generated either deterministically, or stochastically. The stochastic attention\nmechanism is proposed in [28], where they generate aj \u2208 {0, 1} from Bernoulli distribution. This\nvariable is learned by maximizing the evidence lower bound (ELBO) with additional regularizations\nfor reducing variance of gradients. In [28], the stochastic attention is shown to perform better than\nthe deterministic counterpart, on image annotation task.\n\n3.1 Stochastic attention with input-adaptive Gaussian noise\n\nDespite the performance improvement in [28], there are two limitations in modeling stochastic\nattention directly with Bernoulli (or Multinoulli) distribution as [28] does, in our purposes:\n1) The variance \u03c32 of Bernoulli is completely dependent on the allocation probability \u00b5.\nSince the variance for Bernoulli distribution is decided as \u03c32 = \u00b5(1 \u2212 \u00b5), the model thus cannot\ngenerate a with low variance if \u00b5 is around 0.5, and vice versa. To overcome such limitation, we\n\n3\n\n\fdisentangle the attention strength a from the attention uncertainty so that the uncertainty could vary\neven with the same attention strength.\n2) The vanilla stochastic attention models the noise independently of the input.\nThis makes it infeasible to model the amount of uncertainty for each input, which is a crucial factor\nfor reliable machine learning. Even for the same prediction tasks and for the same set of features, the\namount of uncertainty for each feature may largely vary across different instances.\nTo overcome these two limitations, we model the standard deviation \u03c3, which is indicative of the\nuncertainty, as an input-adaptive function \u03c3(x), enabling to re\ufb02ect different amount of con\ufb01dence\nthe model has for each feature, for a given instance. As for distribution, we use Gaussian distribution,\nwhich is probably the most simple and ef\ufb01cient solution for our purpose, and also easy to implement.\nWe \ufb01rst assume that a subset of the neural network parameters \u03c9, associated with generating attentions,\nhas zero-mean isotropic Gaussian prior with precision \u03c4. Then the attention scores before squashing,\ndenoted as z, are generated from conditional distribution p\u03b8(z|x, \u03c9), which is also Gaussian:\n\np(\u03c9) = N (0, \u03c4\u22121I),\n\np\u03b8(z|x, \u03c9) = N (\u00b5(x, \u03c9; \u03b8), diag(\u03c32(x, \u03c9; \u03b8)))\n\n(1)\nwhere \u00b5(\u00b7, \u03c9; \u03b8) and \u03c3(\u00b7, \u03c9; \u03b8) are mean and s.d., parameterized by \u03b8. Note that \u00b5 and \u03c3 are\ngenerated from the same layer, but with different set of parameters, although we denote those\nparameters as \u03b8 in general. The actual attention a is then obtained by applying some squashing\nfunction \u03c0(\u00b7) to z (e.g. sigmoid or hyperbolic tangent): a = \u03c0(z). For comparison, one can think of\nthe vanilla stochastic attention of which variance is independent of inputs.\n\np(\u03c9) = N (0, \u03c4\u22121I),\n\np\u03b8(z|x, \u03c9) = N (\u00b5(x, \u03c9; \u03b8), diag(\u03c32))\n\n(2)\nHowever, as we mentioned, this model cannot express different amount of uncertainties over features.\nOne important aspect of our model is that, in terms of graphical representation, the distribution p(\u03c9)\nis independent of x, while the distribution p\u03b8(z|x, \u03c9) is conditional on x. That is, p(\u03c9) tends to\ncapture uncertainty of model parameters (epistemic uncertainty), while p\u03b8(z|x, \u03c9) reacts sensitively\nto uncertainty in data, varying across different input points (heteroscedastic uncertainty) [14]. When\nmodeled together, it has been empirically shown that the quality of uncertainty improves [14]. Such\nmodeling both input-agnostic and input-dependent uncertainty is especially important in risk analysis\ntasks in healthcare, to capture both the uncertainty from insuf\ufb01cient amount of clinical data (e.g. rare\ndiseases), and the uncertainty that varies from patients to patients (e.g. sepsis).\n\n3.2 Variational inference\nWe now model what we have discussed so far. Let Z be the set of latent variables {z(n)}N\nn=1 that\nstands for attention weight before squashing. In neural network, the posterior distribution p(Z, \u03c9|D)\nis usually computationally intractable since p(D) is so due to nonlinear dependency between variables.\nThus, we utilize variational inference, which is an approximation method that has been shown to be\nsuccessful in many applications of neural networks [16, 23], along with reprameterization tricks for\npathwise backpropagation [15, 8].\nToward this, we \ufb01rst de\ufb01ne our variational distribution as\n\nq(Z, \u03c9|D) = qM(\u03c9|X, Y)q(Z|X, Y, \u03c9).\n\n(3)\nWe set qM(\u03c9|X, Y) to dropout approximation [7] with variational parameter M. [7] showed that\na neural network with Gaussian prior on its weight matrices can be approximated with variational\ninference, in the form of dropout sampling of deterministic weight matrices and (cid:96)2 weight decay. For\nthe second term, we drop the dependency on Y (since it is not available in test time) and simply set\nq(Z|X, Y, \u03c9) to be equivalent to p\u03b8(Z|X, \u03c9), which works well in practice [23, 28].\nUnder the SGVB framework [16], we maximize the evidence lower bound (ELBO):\n\nlog p(Y|X) \u2265 E\u03c9\u223cqM(\u03c9|X,Y),Z\u223cp\u03b8(Z|X,\u03c9) [log p(Y|X, Z, \u03c9)]\n\n(4)\n(5)\nwhere we approximate the expectation in (4) via Monte-Carlo sampling. The \ufb01rst KL term nicely\nreduces to (cid:96)2 regularization for M with dropout approximation [7]. The second KL term vanishes as\nthe two distributions are equivalent. Consequently, our \ufb01nal maximization objective is:\n\n\u2212 KL[qM(\u03c9|X, Y)(cid:107)p(\u03c9)] \u2212 KL[q(Z|X, Y, \u03c9)(cid:107)p\u03b8(Z|X, \u03c9)]\n\nL(\u03b8, M; X, Y) =\n\nlog p\u03b8(y(n)|\u02dcz(n), x(n)) \u2212 \u03bb(cid:107)M(cid:107)2\n\n(6)\n\n(cid:88)\n\n4\n\n\fwhere we \ufb01rst sample random weights with dropout masks(cid:101)\u03c9 \u223c qM(\u03c9|X, Y) and sample z such\nthat \u02dcz = g(x, \u02dc\u03b5,(cid:101)\u03c9), \u02dc\u03b5 \u223c N (0, I), with a pathwise derivative function g for reparameterization trick.\n\n\u03bb is a tunable hyperparameter; however in practice it can be simply set to common (cid:96)2 decay shared\nthroughout the network, including other deterministic weights.\nWhen testing with a novel input instance x\u2217, we can compute the probability of having the correct\nlabel y\u2217 by our model, p(y\u2217|x\u2217) with Monte-Carlo sampling:\n\nS(cid:88)\n\np(y\u2217|x\u2217, \u02dcz(s))\n\n(7)\n\n(cid:90)(cid:90)\n\np(y\u2217|x\u2217) =\n\np(y\u2217|x\u2217, z)p(z|x\u2217, \u03c9)p(\u03c9|X, Y)d\u03c9dz \u2248 1\nS\n\nwhere we \ufb01rst sample dropout masks(cid:101)\u03c9(s) \u223c qM(\u03c9|X, Y) and then sample \u02dcz(s) \u223c p\u03b8(z|x\u2217,(cid:101)\u03c9(s)).\n\ns=1\n\nUncertainty Calibration The quality of uncertainty from (7) can be evaluated with reliability\ndiagram shown in Figure 1. Better calibrated uncertainties produce smaller gaps beween model\ncon\ufb01dences and actual accuracies, shown in green bars. Thus, the perfect calibration occurs when\nthe con\ufb01dences exactly matches the actual accuracies: p(correct|con\ufb01dence = \u03c1) = \u03c1,\u2200\u03c1 \u2208 [0, 1]\n[9]. Also, [22, 9] proposed a summary statistic for calibration, called the Expected Calibration Error\n(ECE). It is the expected gap w.r.t. the distribution of model con\ufb01dence (or frequency of bins):\n\n(cid:2)|p(correct|con\ufb01dence) \u2212 con\ufb01dence|(cid:3)\n\nECE = Econ\ufb01dence\n\n(8)\n\n4 Application to RNNs for Prediction on Time-Series Data\n\nOur variational attention model is generic and can be applied to any generic deep neural network that\nleverages attention mechanism. However, in this section, we describe its application to prediction\nfrom time-series data, since our target application is risk analysis from electronic health records.\n\nReview of the RETAIN model As a base deep network for learning from time-series data, we\nconsider RETAIN [3], which is an attentional RNN model with two types of attentions\u2013across\ntimesteps and across features. RETAIN obtains state-of-the-art performance on risk prediction tasks\nfrom electronic health records, and is able to provide useful interpretations via learned attentions.\nWe now brie\ufb02y review the overall structure of RETAIN. We match the notation with those in\nthe original paper for clear reference. Suppose we are interested in a timestep i. With the input\nembeddings v1, . . . , vi, we generate two different attentions: across timesteps (\u03b1) and features (\u03b2).\n(9)\n(10)\n(11)\n\nhi, ..., h1 = RNN\u03b2(vi, ..., v1; \u03c9),\ndj = W\u03b2hj + b\u03b2 for j = 1, ..., i,\n\u03b2j = tanh(dj) for j = 1, ..., i.\n\ngi, ..., g1 = RNN\u03b1(vi, ..., v1; \u03c9),\n\u03b1gj + b\u03b1 for j = 1, ..., i,\nej = wT\n\u03b11, ..., \u03b1i = Softmax(e1, ..., ei),\n\nThe parameters of two RNNs are collected as \u03c9. From the RNN outputs g and h, the attention\nlogits e and d are generated, followed by squashing functions Softmax and tanh respectively. Then\nthe generated two attentions \u03b1 and \u03b2 are multiplied back to the input embedding v, followed by a\nj=1 \u03b1j\u03b2j (cid:12) vj. A \ufb01nal linear predictor is learned based on it:\n\nconvex sum c up to timestep i: ci =(cid:80)i\n(cid:98)yi = Sigmoid(wTci + b).\n\nThe most important feature of RETAIN is that it allows us to interpret what the model has learned as\nfollows. What we are interested in is contribution, which shows xk\u2019s aggregate effect to the \ufb01nal\nprediction at time j. Since RETAIN has attentions on both timesteps (\u03b1j) and features (\u03b2j), the\ncomputation of aggregate contribution takes both of them into consideration when computing the \ufb01nal\ncontribution of an input data point at a speci\ufb01c timestep: \u03c9(y, xj,k) = \u03b1jwT(\u03b2j (cid:12) Wemb[:, k])xj,k.\nIn other words, it is a certain portion of logit Sigmoid\n\n\u22121((cid:98)yi) = wTci +b for which xj,k is responsible.\n\nInterpretation as a probabilistic model The interpretation of RETAIN as a probabilistic model\nis quite straightforwrad. First, the RNN parameters \u03c9 (9) as gaussian latent variables (1) are\napproximated with MC dropout with \ufb01xed probabilities [7, 5, 27]. The input dependent latent\nvariables Z (1) simply correspond to the collection of e and d (10), the attention logits. The log\nvariances of e and d are generated in the same way as their mean, from the output of RNNs g and d\n\n5\n\n\fPhysioNet\n\nStay < 3\n\nCardiac\n\nMortality\n0.7652\u00b1 0.02\n0.7635\u00b1 0.02\n0.7764\u00b1 0.01\n0.7827\u00b1 0.02\n0.7770\u00b1 0.02\n\nRETAIN-DA [3]\nRETAIN-SA [28]\nUA-Independent\n\n0.7965\u00b1 0.01\n0.7695\u00b1 0.02\n0.8019\u00b1 0.01\n0.8017\u00b1 0.01\n0.8114\u00b1 0.01\nTable 1: The multi-class classi\ufb01cation performance on the three electronic health records datasets. The reported\nnumbers are mean AUROC and standard errors for 95% con\ufb01dence interval over \ufb01ve random splits.\n\n0.8515\u00b1 0.02\n0.8412\u00b1 0.02\n0.8572\u00b1 0.02\n0.8628\u00b1 0.02\n0.8577\u00b1 0.01\n\n0.9485\u00b1 0.01\n0.9360\u00b1 0.01\n0.9516\u00b1 0.01\n0.9563\u00b1 0.01\n0.9612\u00b1 0.01\n\nUA\nUA+\n\nMIMIC\nSepsis\n\nRecovery\n0.8830\u00b1 0.01\n0.8582\u00b1 0.02\n0.8895\u00b1 0.01\n0.9049\u00b1 0.01\n0.9074\u00b1 0.01\n\nCancer\n\nPancreatic\n0.8528\u00b1 0.01\n0.8444\u00b1 0.01\n0.8533\u00b1 0.03\n0.8604\u00b1 0.01\n0.8638\u00b10.02\n\nbut with different set of parameters. Also the reparameterization trick for diagonal gaussian is simple\n[16]. We now maximize the ELBO (6), equipped with all the components X,Y,Z, and \u03c9 as in the\nprevious section.\n5 Experiments\nTasks and Datasets We validate the performance of our model on various risk prediction tasks\nfrom multiple EHR datasets, for both the prediction accuracy and prediction reliability.\n1) PhysioNet This dataset [11] contains 4,000 medical records from ICU3. Each record contains\n48 hours of records, with 155 timesteps, each of which contains 36 physiolocial signals including\nheart rate, repiration rate and temperature. The challenge comes with four binary classi\ufb01cation tasks,\nnamely, 1) Mortality prediction, 2) Length-of-stay less than 3 days: whether the patient will stay in\nICU for less than three days, 3) Cardiac conditon: whether the patient will have a cardiac condition,\nand 4) Recovery from surgery: whether the patient was recovering from surgery.\n2) Pancreatic Cancer This dataset is a subset of the EHR database of the National Health Insurance\nSystem (NHIS) in South Korea, consisting of anonymized medical check-up records from 2002 to\n2013, which includes around 1.5 million records. We extract 3, 699 patient records from this database,\namong which 1, 233 are patients diagnosed of pancreatic cancer. The task here is to predict the onsets\nof pancreatic cancer in 2013 using the records from 2002 to 2012 (11 timesteps), that consists of 34\nvariables regarding general information (e.g., sex, height, past medical history, family history) as\nwell as vital information (e.g., systolic pressure, hemoglobin level, creatinine level) and risk inducing\nbehaviors (e.g., tobacco and alcohol consumption).\n3) MIMIC-Sepsis This is the subset of the MIMIC III dataset [12] for sepsis prediction, which\nconsists of 58,000 hospital admissions for 38,646 adults over 12 years. We use a subset that consists\nof 22,395 records of patients over age 15 and stayed in ICUs between 2001 and 2012, among which\n2,624 patients are diagnosed of sepsis. We use the data from the \ufb01rst 48 hours after admission (24\ntimesteps). For features at each timestep, we select 14 sepsis-related variables including arterial blood\npressure, heart rate, FiO2, and Glass Coma Score (GCS), following the clinicians\u2019 guidelines. We\nuse Sepsis-related Organ Failure Assessment scores (SOFA) to determine the onset of sepsis.\nFor all datasets, we generates \ufb01ve random splits of training/validation/test with the ratio of 80% :\n10% : 10%. Detailed description of the datasets, network con\ufb01guration, and hyperparameters are\nfully described in the appendix section.\n\nBaselines We now describe our uncertainty-calibrated attention models and relevant baselines.\n1) RETAIN-DA: The recurrent attention model in [3], which uses deterministic soft attention.\n2) RETAIN-SA: RETAIN model with the stochastic hard attention proposed by [28], that models\nthe attention weights with multinoulli distribution, which is learned by variational inference.\n3) UA-independent: The input-independent version of our uncertainty-aware attention model in (2)\nwhose variance is modeled indepently of the input.\n4) UA: Our input-dependent uncertainty-aware attention model in (1).\n5) UA+: The same as UA, but with additional modeling of input-adaptive noise at the \ufb01nal prediction\nas done in [14], to account for output uncertainty as well.\n\n5.1 Evaluation of the binary classi\ufb01cation performance\nWe \ufb01rst examine the prediction accuracy of baselines and our models in a standard setting where the\nmodel always makes a decision. Table 1 contains the accuracy of baselines and our models measured\n\n3We only use the TrainingSetA, for which the labels were available\n\n6\n\n\f35m 5s\n38m10s\n38m 55s (current)\n\nMechVent DiasABP HR Temp\n36.2\n36.7\n35.2\n\n81\n75\n67\n\n61\n64\n57\n\n0\n0\n1\n\nSysABP\n\n135\n94\n105\n\nFiO2 MAP Urine GCS\n15\n15\n10\n\nN/A\nN/A\n35\n\n71\n74\n80\n\n1\n1\n1\n\n(a) RETAIN\n\n(b) RETAIN-SA\n\n(c) UA\n\nFigure 2: Visualization of contributions for a selected patient on PhysioNet mortality prediction task.\nMechVent - Mechanical ventilation, DiasABP - Diastolic arterial blood pressure, HR - Heart rate, Temp\n- Temperature, SysABP - Systolic arterial blood pressure, FiO2 - Fractional inspired Oxygen, MAP - Mean\narterial blood pressure, Urine - Urine output, GCS - Glasgow coma score. The table presents the value of\nphysiological variables at the previous and the current timestep. Dots correspond to sampled attention weights.\n\nin area under the ROC curve (AUROC). We observe that UA variants signi\ufb01cantly outperforms both\nRETAIN variants with either deterministic or stochastic attention mechanisms on all datasets. Note\nthat RETAIN-SA, that generates attention from Bernoulli distribution, performs the worst. This may\nbe because the model is primarily concerned with whether to attend or not to each feature, which\nmakes sense when most features are irrelevant, such as with machine translation, but not in the case of\nclinical prediction where most of the variables are important. UA-independent performs signi\ufb01cantly\nworse than UA or UA+, which demonstrates the importance of input-dependent modeling of the\nvariance. Additional modeling of output uncertainty with UA+ yields performance gain in most cases.\n\n5.2\n\nInterpretability and accuracy of generated attentions\n\nTo obtain more insight, we further analyze the contribution of each feature in PhysioNet mortality\ntask in Figure 2 for a patient at the timestep with the highest attention \u03b1, with the help of a physician.\nThe table in Figure 2 is the value of the variables at the previous checkpoints and the current timestep.\nThe difference between the current and the previous tmesteps is signi\ufb01cant - the patient is applied\nmechanical ventilation; the body temperature, diastolic arterial blood pressure, and heart rate dropped,\nand GCS, which is a measure of consciousness, dropped from 15 to 10. The fact that the patient is\napplied mechanical ventilation, and that the GCS score is lowered, are both very important markers\nfor assessing patient\u2019s condition. Our model correctly attends to those two variables, with very\nlow uncertainty. SysABP and DiasABP are variables that has cyclic change in value, and are all\nwithin normal range; however RETAIN-DA attended to these variables, perhaps due to having a\ndeterministic model which led it to over\ufb01t. Heart rate is out of normal range (60-90), which is\nproblematic but is not de\ufb01nitive, and thus UA attended to it with high variance. RETAIN-SA results\nin overly incorrect and noisy attention except for FiO2 that did not change its value. Attention on\nUrine by all models may be the artifact that comes from missing entry in the previous timestep. In\nthis case, UA assigned high variance, which shows that it is uncertain about this prediction.\nThe previous example shows another advantage of our model:\nit provides a richer interpretations of why the model has made\nsuch predictions, compared to ones provided by deterministic\nor stochastic model without input-dependent modeling of un-\ncertainty. We further compared UA against RETAIN-DA for\naccuracy of the attentions, using variables selected meaningful\nby the clinicians as ground truth labels (avg. 132 variables per\nrecord), from EHRs for a male and a female patient randomly selected from 10 age groups (40s-80s),\non PhysioNet-Mortality. We observe that UA generates accurate interpretations that better comply\nwith clinicians\u2019 intepretations (Table 2).\n\nTable 2: Percentage of features se-\nlected from each model that match the\nfeatures selected by the clinicians.\n\nSpeci\ufb01city\n\nSensitivity\n\n75%\n87%\n\n68%\n82%\n\nDA\nUA\n\n5.3 Evaluation of prediction reliability\nAnother important goal that we aimed to achieve with the modeling of uncertainty in the attention is\nachieving high reliability in prediction. Prediction reliability is orthogonal to prediction accuracy,\n\n7\n\nVentDABPHRTempSABPFiO2MAPUrineGCS30201001020ContributionVentDABPHRTempSABPFiO2MAPUrineGCS6040200204060ContributionVentDABPHRTempSABPFiO2MAPUrineGCS6040200204060Contribution\fPhysioNet\n\nRETAIN-DA [3]\nRETAIN-SA [28]\nUA-Independent\n\nUA\nUA+\n\nMortality\n7.23 \u00b1 0.56\n7.70 \u00b1 0.60\n5.03 \u00b1 0.94\n4.22 \u00b1 0.82\n4.41 \u00b1 0.52\n\nStay < 3\n2.04 \u00b1 0.56\n3.77 \u00b1 0.07\n2.74 \u00b1 1.44\n1.43 \u00b1 0.53\n1.68 \u00b1 0.16\n\nCardiac\n5.70 \u00b1 1.56\n8.82 \u00b1 0.64\n3.55 \u00b1 0.56\n3.33 \u00b1 0.96\n2.66 \u00b1 0.16\n\nRecovery\n4.89 \u00b1 0.97\n5.39 \u00b1 0.80\n4.87 \u00b1 1.46\n4.46 \u00b1 0.73\n3.98 \u00b1 0.59\n\nCancer\n\nPancreatic\n5.45 \u00b1 0.79\n9.69 \u00b1 3.90\n4.51 \u00b1 0.72\n3.61 \u00b1 0.55\n3.22 \u00b1 0.69\n\nMIMIC\nSepsis\n3.05 \u00b1 0.56\n5.75 \u00b1 0.29\n2.04 \u00b1 0.62\n1.78 \u00b1 0.41\n2.04 \u00b1 0.62\n\nTable 3: Mean Expected Calibration Error (ECE) of various attention models over 5 random splits.\n\n(a) PhysioNet\n- Mortality\n\n(b) PhysioNet\n\n- Stay < 3\n\n(c) PhysioNet\n\n- Cardiac\n\n(d) PhysioNet\n- Recovery\n\n(e) Pancreatic\n\nCancer\n\n(f) MIMIC\n\n- Sepsis\n\nFigure 3: Experiments on prediction reliability. The line charts show the ratio of incorrect predictions as a\nfunction of the ratio of correct predictions for all datasets.\n\nand [22] showed that state-of-the-art deep networks are not reliable as they are not well-calibrated\nto correlate model con\ufb01dence with model strength. Thus, to demonstrate the reliability of our\nuncertainty-aware attention, we evaluate it for the uncertainty calibration performance against baseline\nattention models in Table 3, using Expected Calibration Errors (ECE) [22] (Eq. (8)). UA and UA+\nare signi\ufb01cantly better calibrated than RETAIN-DA, RETAIN-SA as well as UA-independent, which\nshows that independent modeling of variance is essential in obtaining well-calibrated uncertainties.\n\nPrediction with \u201cI don\u2019t know\" option We further evaluate the reliability of our predictive model\nby allowing it to say I don\u2019t know (IDK), where the model can refrain from making a hard decision\nof yes or no when it is uncertain about its prediction. This ability to defer decision is crucial for\npredictive tasks in clinical environments, since those deferred patient records could be given a second\nround examination by human clinicians to ensure safety in its decision. To this end, we measure the\nuncertainty of each prediction by sampling the variance of the prediction using both MC-dropout and\nstochastic Gaussian noise over 30 runs, and simply predict the label for the instances with standard\ndeviation larger than some set threshold as IDK.\nNote that we use RETAIN-DA with MC-Dropout [5] as our baseline for this experiment, since\nRETAIN-DA is deterministic and cannot output uncertainty 4 We report the performance of\nRETAIN + DA, UA, and UA+ for all tasks by plotting the ratio of incorrect predictions as a\nfunction of the ratio of correct predictions, by varying the threshold on the model con\ufb01dence\n(See Figure 3). We observe that both UA and UA+ output much smaller ratio of incorrect\npredictions at the same ratio of correct predictions compared to RETAIN + DA, by saying IDK\non uncertain inputs. This suggests that our models are relatively more reliable and safer to use\nwhen making decisions for prediction tasks where incorrect predictions can lead to fatal consequences.\n\n6 Conclusion\nWe proposed uncertainty-aware attention (UA) mechanism that can enhance reliability of both\ninterpretations and predictions of general deep neural networks. Speci\ufb01cally, UA generates attention\nweights following Gaussian distribution with learned mean and variance, that are decoupled and\ntrained in input-adaptive manner. This input-adaptive noise modeling allows to capture heteroscedastic\nuncertainty, or the instance-speci\ufb01c uncertainty, which in turn yields more accurate calibration of\nprediction uncertainty. We trained it using variational inference and validated it on seven different\ntasks from three electronic health records, on which it signi\ufb01cantly outperformed the baselines and\nprovided more accurate and richer interpretations. Further analysis of prediction reliability shows\nthat our model is accurately calibrated and thus can defer predictions when making prediction with \u201cI\ndon\u2019t know\u201d option.\n\n4RETAIN-SA is not compared since it largely underperforms all others and is not a meaningful baseline.\n\n8\n\n0.00.20.40.60.8Ratio of Correct Predictions0.000.020.040.060.080.100.120.14Ratio of Incorrect PredictionsRETAIN-DAUAUA+0.00.20.40.60.81.0Ratio of Correct Predictions0.0000.0050.0100.0150.0200.0250.0300.0350.040Ratio of Incorrect PredictionsRETAIN-DAUAUA+0.00.20.40.60.8Ratio of Correct Predictions0.000.020.040.060.080.10Ratio of Incorrect PredictionsRETAIN-DAUAUA+0.00.20.40.60.8Ratio of Correct Predictions0.000.020.040.060.080.100.120.140.16Ratio of Incorrect PredictionsRETAIN-DAUAUA+0.00.20.40.60.8Ratio of Correct Predictions0.000.050.100.150.20Ratio of Incorrect PredictionsRETAIN-DAUAUA+0.00.20.40.60.8Ratio of Correct Predictions0.000.020.040.060.080.100.120.14Ratio of Incorrect PredictionsRETAIN-DAUAUA+\fAcknowledgments\n\nThis work was supported by a Machine Learning and Statistical Inference Framework for Explainable\nArti\ufb01cial Intelligence (No.2017-0-01779) funded by Institution for Information & Communications\n& Technology Promotion (IITP) and Basic Science Research Program through the National Research\nFoundation of Korea (NRF) funded by the Ministry of Education (2015R1D1A1A01061019) of\nSouth Korea. Juho Lee is funded by the European Research Council under the European Union\u2019s\nSeventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071.\n\nReferences\n[1] M. S. Ayhan and P. Berens. Test-time Data Augmentation for Estimation of Heteroscedastic\n\nAleatoric Uncertainty in Deep Neural Networks. MIDL, Mar. 2018.\n\n[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align\n\nand translate. ICLR, 2015.\n\n[3] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart. Retain: An interpretable\n\npredictive model for healthcare using reverse time attention mechanism. In NIPS. 2016.\n\n[4] J. Futoma, S. Hariharan, and K. A. Heller. Learning to detect sepsis with a multitask gaussian\n\nprocess RNN classi\ufb01er. In ICML, 2017.\n\n[5] Y. Gal and Z. Ghahramani. A Theoretically Grounded Application of Dropout in Recurrent\n\nNeural Networks. ArXiv e-prints.\n\n[6] Y. Gal and Z. Ghahramani. Bayesian Convolutional Neural Networks with Bernoulli Approxi-\n\nmate Variational Inference. ArXiv e-prints, June 2015.\n\n[7] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncer-\n\ntainty in deep learning. In ICML, 2016.\n\n[8] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. In I. Guyon, U. V. Luxburg, S. Bengio,\n\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS, 2017.\n\n[9] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In\n\nICML, 2017.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR,\n\n2016.\n\n[11] D. J. S. L. A. C. Ivanovitch Silva, Galan Moody and R. G. Mark. Predicting in-hospital mortality\n\nof icu patients: The physionet/computing in cardiology challenge 2012. In In CinC, 2012.\n\n[12] A. E. Johnson, T. J. Pollard, L. Shen, L. wei H. Lehman, M. Feng, M. Ghassemi, B. Moody,\n\nP. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database.\n\n[13] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian SegNet: Model Uncertainty in Deep\nConvolutional Encoder-Decoder Architectures for Scene Understanding. ArXiv e-prints, Nov.\n2015.\n\n[14] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for\n\nComputer Vision? In NIPS, 2017.\n\n[15] D. P. Kingma, T. Salimans, and M. Welling. Variational Dropout and the Local Reparameteriza-\n\ntion Trick. ArXiv e-prints, June 2015.\n\n[16] D. P. Kingma and M. Welling. Auto encoding variational bayes. In ICLR. 2014.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional\n\nNeural Networks. In NIPS, 2012.\n\n[18] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty\n\nestimation using deep ensembles. In NIPS, pages 6405\u20136416, 2017.\n\n9\n\n\f[19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[20] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.\n\n[21] C. J. Maddison, A. Mnih, and Y. Whye Teh. The Concrete Distribution: A Continuous\n\nRelaxation of Discrete Random Variables. ArXiv e-prints, Nov. 2016.\n\n[22] M. P. Naeini, G. F. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using\n\nbayesian binning. In AAAI, 2015.\n\n[23] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional\n\ngenerative models. In NIPS. 2015.\n\n[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[25] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. In NIPS,\n\n2015.\n\n[26] R. Tanno, D. E. Worrall, A. Ghosh, E. Kaden, S. N. Sotiropoulos, A. Criminisi, and D. C.\nAlexander. Bayesian Image Quality Transfer with CNNs: Exploring Uncertainty in dMRI\nSuper-Resolution. ArXiv e-prints, May 2017.\n\n[27] J. van der Westhuizen and J. Lasenby. Bayesian LSTMs in medicine. ArXiv e-prints, June 2017.\n\n[28] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio.\nShow, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[29] L. Zhu and N. Laptev. Deep and Con\ufb01dent Prediction for Time Series at Uber. ArXiv e-prints,\n\nSept. 2017.\n\n10\n\n\f", "award": [], "sourceid": 503, "authors": [{"given_name": "Jay", "family_name": "Heo", "institution": "UNIST"}, {"given_name": "Hae Beom", "family_name": "Lee", "institution": "KAIST"}, {"given_name": "Saehoon", "family_name": "Kim", "institution": "AITRICS"}, {"given_name": "Juho", "family_name": "Lee", "institution": "University of Oxford"}, {"given_name": "Kwang Joon", "family_name": "Kim", "institution": "Yonsei University College of Medicine"}, {"given_name": "Eunho", "family_name": "Yang", "institution": "Korea Advanced Institute of Science and Technology; AItrics"}, {"given_name": "Sung Ju", "family_name": "Hwang", "institution": "KAIST, AItrics"}]}