{"title": "Single-Model Uncertainties for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6417, "page_last": 6428, "abstract": "We provide single-model estimates of aleatoric and epistemic uncertainty for deep neural networks.\nTo estimate aleatoric uncertainty, we propose Simultaneous Quantile Regression (SQR), a loss function to learn all the conditional quantiles of a given target variable.\nThese quantiles can be used to compute well-calibrated prediction intervals.\nTo estimate epistemic uncertainty, we propose Orthonormal Certificates (OCs), a collection of diverse non-constant functions that map all training samples to zero.\nThese certificates map out-of-distribution examples to non-zero values, signaling epistemic uncertainty.\nOur uncertainty estimators are computationally attractive, as they do not require ensembling or retraining deep models, and achieve state-of-the-art performance.", "full_text": "Single-Model Uncertainties for Deep Learning\n\nNatasa Tagasovska\n\nDepartment of Information Systems\n\nHEC Lausanne, Switzerland\n\nnatasa.tagasovska@unil.ch\n\nDavid Lopez-Paz\n\nFacebook AI Research\n\nParis, France\ndlp@fb.com\n\nAbstract\n\nWe provide single-model estimates of aleatoric and epistemic uncertainty for deep\nneural networks. To estimate aleatoric uncertainty, we propose Simultaneous\nQuantile Regression (SQR), a loss function to learn all the conditional quantiles\nof a given target variable. These quantiles can be used to compute well-calibrated\nprediction intervals. To estimate epistemic uncertainty, we propose Orthonormal\nCerti\ufb01cates (OCs), a collection of diverse non-constant functions that map all\ntraining samples to zero. These certi\ufb01cates map out-of-distribution examples to\nnon-zero values, signaling epistemic uncertainty. Our uncertainty estimators are\ncomputationally attractive, as they do not require ensembling or retraining deep\nmodels, and achieve competitive performance.\n\n1\n\nIntroduction\n\nDeep learning permeates our lives, with prospects to drive our cars and decide on our medical\ntreatments. These ambitions will not materialize if deep learning models remain unable to assess\ntheir con\ufb01dence when performing under diverse situations. Being aware of uncertainty in prediction\nis crucial in multiple scenarios. First, uncertainty plays a central role on deciding when to abstain\nfrom prediction. Abstention is a reasonable strategy to deal with anomalies [11], outliers [39],\nout-of-distribution examples [69], detect and defend against adversaries [75], or delegate high-risk\npredictions to humans [14, 27, 12]. Deep classi\ufb01ers that \u201cdo not know what they know\u201d may\ncon\ufb01dently assign one of the training categories to objects that they have never seen. Second,\nuncertainty is the backbone of active learning [68], the problem of deciding what examples should\nhumans annotate to maximally improve the performance of a model. Third, uncertainty estimation is\nimportant when analyzing noise structure, such as in causal discovery [52] and in the estimation of\npredictive intervals. Fourth, uncertainty quanti\ufb01cation is one step towards model interpretability [3].\n\nBeing a wide-reaching concept, most taxonomies consider three sources of uncertainty: approxi-\nmation, aleatoric, and epistemic uncertainties [20]. First, approximation uncertainty describes the\nerrors made by simplistic models unable to \ufb01t complex data (e.g., the error made by a linear model\n\ufb01tting a sinusoidal curve). Since the sequel focuses on deep neural networks, which are known to\nbe universal approximators [15], we assume that the approximation uncertainty is negligible and\nomit its analysis. Second, aleatoric uncertainty (from the Greek word alea, meaning \u201crolling a\ndice\u201d) accounts for the stochasticity of the data. Aleatoric uncertainty describes the variance of the\nconditional distribution of our target variable given our features. This type of uncertainty arises due\nto hidden variables or measurement errors, and cannot be reduced by collecting more data under the\nsame experimental conditions. Third, epistemic uncertainty (from the Greek word episteme, meaning\n\u201cknowledge\u201d) describes the errors associated to the lack of experience of our model at certain regions\nof the feature space. Therefore, epistemic uncertainty is inversely proportional to the density of\ntraining examples, and could be reduced by collecting data in those low density regions. Figure 1\ndepicts our aleatoric uncertainty (gray shade) and epistemic uncertainty (pink shade) estimates for a\nsimple one-dimensional regression example.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Training data with non-Gaussian noise (blue dots), predicted median (solid line), 65% and\n80% quantiles (dashed lines), aleatoric uncertainty or 95% prediction interval (gray shade, estimated\nby SQR Sec. 2), and epistemic uncertainty (pink shade, estimated by orthonormal certi\ufb01cates Sec. 3).\n\nIn general terms, accurate estimation of aleatoric and epistemic uncertainty would allow machine\nlearning models to know better about their limits, acknowledge doubtful predictions, and signal test\ninstances that do not resemble anything seen during their training regime. As argued by Begoli et al.\n[5], uncertainty quanti\ufb01cation is a problem of paramount importance when deploying machine learn-\ning models in sensitive domains such as information security [72], engineering [82], transportation\n[87], and medicine [5], to name a few.\n\nDespite its importance, uncertainty quanti\ufb01cation is a largely unsolved problem. Prior literature on\nuncertainty estimation for deep neural networks is dominated by Bayesian methods [37, 6, 25, 44, 45,\n79], implemented in approximate ways to circumvent their computational intractability. Frequentist\napproaches rely on expensive ensemble models [8], explored only recently [58, 49, 60], are often\nunable to estimate asymmetric, multimodal, heteroskedastic predictive intervals.\n\nThe ambition of this paper is to provide the community with trust-worthy, simple, scalable, single-\nmodel estimators of aleatoric and epistemic uncertainty in deep learning. To this end, we contribute:\n\n\u2022 Simultaneous Quantile Regression (SQR) to estimate aleatoric uncertainty (Section 2),\n\u2022 Orthonormal Certi\ufb01cates (OCs) to estimate epistemic uncertainty (Section 3),\n\u2022 experiments showing the competitive performance of these estimators (Section 4),\n\u2022 and an uni\ufb01ed literature review on uncertainty estimation (Section 5).\n\nWe start our exposition by exploring the estimation of aleatoric uncertainty, that is, the estimation of\nuncertainty related to the conditional distribution of the target variable given the feature variable.\n\n2 Simultaneous Quantile Regression for Aleatoric Uncertainty\n\nLet F (y) = P (Y \u2264 y) be the strictly monotone cumulative distribution function of a target variable Y\ntaking real values y. Consequently, let F \u22121(\u03c4 ) = inf {y : F (y) \u2265 \u03c4} denote the quantile distribution\nfunction of the same variable Y , for all quantile levels 0 \u2264 \u03c4 \u2264 1. The goal of quantile regression is\nto estimate a given quantile level \u03c4 of the target variable Y , when conditioned on the values x taken\nby a feature variable X. That is, we are interested in building a model \u02c6y = \u02c6f\u03c4 (x) approximating the\nconditional quantile distribution function y = F \u22121(\u03c4|X = x). One strategy to estimate such models\n\nis to minimize the pinball loss [23, 48, 47, 22]:\n\n\u2113\u03c4 (y, \u02c6y) = (cid:26) \u03c4 (y \u2212 \u02c6y)\n\n(1 \u2212 \u03c4 )(\u02c6y \u2212 y)\n\nif y \u2212 \u02c6y \u2265 0,\nelse.\n\nTo see this, write\n\nE [\u2113\u03c4 (y, \u02c6y)] = (\u03c4 \u2212 1)Z \u02c6y\n\n\u2212\u221e\n\n(y \u2212 \u02c6y)dF (y) + \u03c4 Z \u221e\n\n\u02c6y\n\n(y \u2212 \u02c6y)dF (y),\n\n2\n\n\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.52.0\u22122\u22121012aleatoricepistemic\fwhere we omit the conditioning on X = x for clarity. Differentiating the previous expression:\n\n\u2202E [\u2113\u03c4 (y, \u02c6y)]\n\n= (1 \u2212 \u03c4 )Z \u02c6y\n\n\u2202 \u02c6y\n\ndF (y) = (1 \u2212 \u03c4 )F (\u02c6y) \u2212 \u03c4 (1 \u2212 F (\u02c6y)) = F (\u02c6y) \u2212 \u03c4.\nSetting the prev. to zero reveals the loss minima at the quantile \u02c6y = F \u22121(\u03c4 ). The absolute loss\ncorresponds to \u03c4 = 1\n\n2 , associated to the estimation of the conditional median.\n\n\u2212\u221e\n\ndF (y) \u2212 \u03c4 Z \u221e\n\n\u02c6y\n\nArmed with the pinball loss, we collect a dataset of identically and independently distributed\n(iid) feature-target pairs (x1, y1), . . . , (xn, yn) drawn from some unknown probability distribution\nP (X, Y ). Then, we may estimate the conditional quantile distribution of Y given X at a single\nquantile level \u03c4 as the empirical risk minimizer \u02c6f\u03c4 \u2208 arg minf\ni=1 \u2113\u03c4 (f (xi), yi). Instead, we\n\npropose to estimate all the quantile levels simultaneously by solving:\n\n1\n\nn Pn\n\n\u02c6f \u2208 arg min\n\nf\n\n1\nn\n\nn\n\nXi=1\n\nE\n\n\u03c4 \u223cU [0,1]\n\n[\u2113\u03c4 (f (xi, \u03c4 ), yi)] .\n\n(1)\n\nWe call this model a Simultaneous Quantile Regression (SQR). In practice, we minimize this\nexpression using stochastic gradient descent, sampling fresh random quantile levels \u03c4 \u223c U[0, 1]\nfor each training point and mini-batch during training. The resulting function \u02c6f (x, \u03c4 ) can be used\nto compute any quantile of the conditional variable Y |X = x. Then, our estimate of aleatoric\nuncertainty is the 1 \u2212 \u03b1 prediction interval (\u03b1 - signi\ufb01cance level) around the median:\n\nua(x\u22c6) := \u02c6f (x\u22c6, 1 \u2212 \u03b1/2) \u2212 \u02c6f (x\u22c6, \u03b1/2).\n\n(2)\n\nSQR estimates the entire conditional distribution of the target variable. One SQR model estimates all\nnon-linear quantiles jointly, and does not rely on ensembling models or predictions. This reduces\ntraining time, evaluation time, and storage requirements. The pinball loss \u2113\u03c4 estimates the \u03c4 -th\nquantile consistently [74], providing SQR with strong theoretical grounding. In contrast to prior\nwork in uncertainty estimation for deep learning [25, 49, 60], SQR can model non-Gaussian, skewed,\nasymmetric, multimodal, and heteroskedastic aleatoric noise in data. Figure 2 shows the bene\ufb01t of\nestimating all the quantiles using a joint model in a noise example where y = cos(10 \u00b7 x1) + N (0, 1\n3 )\nand x \u223c N (0, I10). In particular, estimating the quantiles jointly greatly alleviates the undesired\nphenomena of crossing quantiles [77]. SQR can be employed in any (pre-trained or not) neural\nnetwork without sacri\ufb01cing performance, as it can be implemented as an additional output layer.\nFinally, Appendix B shows one example of using SQR for binary classi\ufb01cation. We leave for future\nresearch how to use SQR for multivariate-output problems.\n\nFigure 2: Estimating all the quantiles jointly greatly alleviates the problem of crossing quantiles.\n\n3 Orthonormal Certi\ufb01cates for Epistemic Uncertainty\n\nTo study the problem of epistemic uncertainty estimation, consider a thought experiment in binary\nclassi\ufb01cation, discriminating between a positive distribution P and a negative distribution Q. Con-\nstruct the optimal binary classi\ufb01er c, mapping samples from P to zero, and mapping samples from Q\nto one. The classi\ufb01er c is determined by the positions of P and Q. Thus, if we consider a second\nbinary classi\ufb01cation problem between the same positive distribution P and a different negative\ndistribution Q\u2032, the new optimal classi\ufb01er c\u2032 may differ signi\ufb01cantly from c. However, both classi\ufb01ers\nc and c\u2032 have one treat in common: they map samples from the positive distribution P to zero.\n\n3\n\n\u2212202\u22123\u22122\u22121012separateestimation\u03c40.1\u03c40.5\u03c40.9\u2212202\u22122024\u03c40.9\u2212\u03c40.5\u03c40.5\u2212\u03c40.1\u2212202\u22122\u22121012jointestimation\u03c40.1\u03c40.5\u03c40.9\u22122020.00.20.40.6\u03c40.9\u2212\u03c40.5\u03c40.9\u2212\u03c40.5\fThis thought experiment illustrates the dif\ufb01culty of estimating epistemic uncertainty when learning\nfrom a positive distribution P without any reference to a negative, \u201cout-of-domain\u201d distribution Q.\nThat is, we are interested not only in one binary classi\ufb01er mapping samples from P to zero, but in\nthe in\ufb01nite collection of such classi\ufb01ers. Considering the in\ufb01nite collection of classi\ufb01ers mapping\nsamples from the positive distribution P to zero, the upper tail of their class-probabilities should\ndepart signi\ufb01cantly from zero at samples not from P , signaling high epistemic uncertainty. This\nintuition motivates our epistemic uncertainty estimate, orthonormal certi\ufb01cates.\n\nTo describe the construction of the certi\ufb01cates, consider a deep model y = f (\u03c6(x)) trained on input-\ntarget samples drawn from the joint distribution P (X, Y ). Here, \u03c6 is a deep featurizer extracting\nhigh-level representations from data, and f is a shallow classi\ufb01er grouping such representations\ninto classes. Construct the dataset of high-level representations of training examples, denoted by\n\u03a6 = {\u03c6(xi)}n\ni=1. Second, train a collection of certi\ufb01cates C = (C1, . . . , Ck). Each training\ncerti\ufb01cate Cj is a simple neural network trained to map the dataset \u03a6 to zero, by minimizing a loss\nfunction \u2113c. Finally, we de\ufb01ne our estimate of epistemic uncertainty as:\n\nue(x\u22c6) := kC \u22a4\u03c6(x\u22c6)k2.\n\n(3)\n\nDue to the smoothness of \u03c6, the average certi\ufb01cate (3) should evaluate to zero near the training\ndistribution. Conversely, for inputs distinct from those appearing in the training distribution, (3)\nshould depart from zero and signal high epistemic uncertainty. A threshold to declare a new input\n\u201cout-of-distribution\u201d can be obtained as a high percentile of (3) across in-domain data.\n\nWhen using the mean squared error loss for \u2113c, certi\ufb01cates can be seen as a generalization of\ndistance-based estimators of epistemic uncertainty:\n\nud(x\u22c6) = min\n\ni=1,...,nk\u03c6(xi) \u2212 \u03c6(x\u22c6)k2 = max\n\ni=1,...,n\n\n2\u03c6(xi)\u22a4\u03c6(x\u22c6) \u2212 k\u03c6(xi)k,\n\ni \u03c6(x\u22c6) = a\u22a4\n\ni \u03c6(x\u22c6) + bi with coef\ufb01cients \ufb01xed by the feature\nwhich is a set of n linear certi\ufb01cates C \u22a4\nrepresentation of each training example xi. We implement k certi\ufb01cates on top of a h-dimensional\nrepresentation as a single h \u00d7 k linear layer, trained to predict the k-vector \u201c0\u201d under some loss \u2113c.\nSince we want diverse, non-constant (at zero) certi\ufb01cates, we impose an orthonormality constraint\nbetween certi\ufb01cates. Orthonormality enforces a diverse set of Lipschitz-1 certi\ufb01cates. Then, we\nconstruct our Orthonormal Certi\ufb01cates (OCs) as:\n\n\u02c6C \u2208 arg min\n\nC\u2208Rh\u00d7k\n\n1\nn\n\nn\n\nXi=1\n\n\u2113c(C \u22a4\u03c6(xi), 0) + \u03bb \u00b7 kC \u22a4C \u2212 Ikk.\n\n(4)\n\nThe use of non-linear certi\ufb01cates is also possible. In this case, the Lipschitz condition for each\ncerti\ufb01cate can be controlled using a gradient penalty [31], and their diversity can be enforced by\nmeasuring explicitly the variance of predictions around zero.\n\nOrthonormal certi\ufb01cates are applicable to both regression and classi\ufb01cation problems. The choice of\nthe loss function \u2113c depends on the particular learning problem under study. As a rule of thumb we\nsuggest training the certi\ufb01cates with the same loss function used in the learning task. When using the\nmean-squared error, linear orthonormal certi\ufb01cates seek the directions in the data with the least amount\nof variance. This leads to an interesting interpretation in terms of Principal Component Analysis\n(PCA). In particular, linear orthonormal certi\ufb01cates minimized by mean squared error estimate\nthe null-space of the training features, the \u201cleast-variant components\u201d, the principal components\nassociated to the smallest singular values of the training features. This view motivates a second-order\nanalysis that provides tail bounds about the behavior of orthonormal certi\ufb01cates.\n\nTheorem 1. Let the in-domain data follow x \u223c N (0, \u03a3) with \u03a3 = V \u039bV \u22a4, the out-domain data\nfollow x\u2032 \u223c N (\u00b5\u2032, \u03a3\u2032) with \u03a3\u2032 = V \u2032\u039b\u2032V \u2032\u22a4, and the certi\ufb01cates C \u2208 Rd\u00d7k be the bottom k\neigenvectors of \u03a3, with associated eigenvalues \u0393. Then,\n\nP (cid:0)kC \u22a4xk2 \u2212 E[kC \u22a4xk2] \u2265 t(cid:1) \u2264 e\u2212t2/(2 maxj \u0393j ),\nP (cid:0)kC \u22a4x\u2032k2 \u2212 E[kC \u22a4x\u2032k2] \u2265 t(cid:1) \u2264 e\u2212t2/(2 maxj \u039b\u2032\n\nj kCj V \u2032\n\nj k2).\n\nProof. By the Gaussian af\ufb01ne transform, we have that C \u22a4x = N (0, C\u03a3C \u22a4). To better understand\nthe covariance matrix, note C\u03a3 = C\u0393 = \u0393C \u21d2 C\u03a3C \u22a4 = \u0393CC \u22a4 = \u0393. The equalities follow by the\n\n4\n\n\feigenvalue de\ufb01nition, the commutativity of diagonal matrices, and the orthogonality of eigenvectors\n\nfor real symmetric matrices. Then, for in-domain data, we can write C \u22a4x = N (0, \u0393).\nNext, recall the Gaussian concentration inequality for L-Lipschitz functions f : Rk \u2192 R, applied to\na standard Normal vector Y = (Y1, . . . , Yk), which is P (f (Y ) \u2212 E[f (Y )] \u2265 t) \u2264 e\u2212t2/(2L) for all\nt > 0. In our case, our certi\ufb01cate vector is distributed as \u221a\u0393Y , where Y is a standard Normal vector.\nThis means that the function f (Y ) = k\u221a\u0393Y k2 has a Lipschitz constant of maxj \u0393j with j \u2208 1..k.\nA similar line of reasoning follows for out-domain data, where C \u22a4x\u2032 = N (C\u00b5\u2032, C\u03a3\u2032C \u22a4). The\nfunction f applied to a standard Normal vector Y is in this case f (Y ) = kCC \u2032\u221a\u039b\u2032Y + C\u00b5\u2032k2.\nThe Lipschitz constant of this function is maxj \u039b\u2032\njk2. Applying the Gaussian concentration\ninequality to this function leads to the second result.\n\nApplying the Gaussian concentration inequality to this function leads to the \ufb01rst tail bound.\n\njkCjC \u2032\n\nThe previous theorem highlights two facts. On the one hand, by estimating the average in-domain\n\ncerti\ufb01cate response E[kC \u22a4xk2] using a validation sample, the \ufb01rst tail bound characterizes the decay\n\nof epistemic uncertainty. On the other hand, the second tail bound characterizes the set of out-\ndomain distributions that we will be able to discriminate against. In particular, these are out-domain\n\ndistributions where kCV \u2032k is small, or associated to a small eigenvalue in \u039b\u2032. That is, certi\ufb01cates\n\nwill be able to distinguish in-domain and out-domain samples as long as these are drawn from two\ndistributions with a suf\ufb01ciently different null-space.\n\nCombined uncertainty While the idea of a combined measure of uncertainty is appealing, aleatoric\nand epistemic unknowns measure different quantities in different units; different applications will\nhighlight the importance of one over the other [19, 44]. We leave for future work the study of\ncombination of different types of uncertainty.\n\n4 Experiments\n\nWe apply SQR (2) to estimate aleatoric uncertainty in Section 4.1, and OCs (3) to estimate epistemic\nuncertainty in Section 4.2. For applications of SQR to causal discovery, see Appendix A. Our code is\navailable at https://github.com/facebookresearch/SingleModelUncertainty.\n\n4.1 Aleatoric Uncertainty: Prediction Intervals\n\nWe evaluate SQR (2) to construct (1 \u2212 \u03b1) Prediction Intervals (PIs) on eight UCI datasets [4]. These\nare intervals containing the true value about some target variable, given the values for some input\nvariable, with at least (1 \u2212 \u03b1)% probability. The quality of prediction intervals is measured by two\ncompeting objectives:\n\nfalling inside the estimated prediction interval;\n\n\u2022 Prediction Interval Coverage Probability (PICP), that is, the number of true observations\n\u2022 Mean Prediction Interval Width (MPIW), that is, the average width of the prediction intervals.\nWe are interested in calibrated prediction intervals (PICP = 1 \u2212 \u03b1) that are narrow (in terms of\nMPIW). For sensitive applications, having well calibrated predictive intervals is a priority.\n\nWe compare SQR to \ufb01ve popular alternatives. ConditionalGaussian [49] \ufb01ts a conditional Gaussian\ndistribution and uses its variance to compute prediction intervals. Dropout [25] uses dropout [38]\nat testing time to obtain multiple predictions, using their empirical quantiles as prediction intervals.\nQualityDriven [60] is a state-of-the art deep model to estimate prediction intervals, that minimizes\na smooth surrogate of the PICP/MPIW metrics. GradientBoosting and QuantileForest [53] are\nensembles of decision trees performing quantile regression.\n\nWe use the same neural network architecture for the \ufb01rst three methods, and cross-validate the\nlearning rate and weight decay parameters for the Adam optimizer [46]. For the tree-based of the\nmethods, we cross-validate the number of trees and the minimum number of examples to make a\nsplit. We repeat all experiments across 20 random seeds. See Appendix C and code for details.\n\nTable 1 summarizes our experiment. This table shows test average and standard deviation PICP of\nthose models achieving a validation PICP in [0.925, 0.975]. These are the models with reasonably\n\n5\n\n\fTable 1: Evaluation of 95% prediction intervals. We show the test average and standard deviation\nPICP of models achieving a validation PICP in [0.925, 0.975]. In parenthesis, we show the test\naverage and standard deviation MPIW associated to those models. We show \u201cnone\u201d, when the method\ncould not \ufb01nd a model with the desired PICP bounds.\n\nConditionalGaussian\n\n0.94 \u00b1 0.03 (0.32 \u00b1 0.09)\nconcrete\n0.94 \u00b1 0.01 (0.18 \u00b1 0.00)\npower\n0.94 \u00b1 0.02 (0.49 \u00b1 0.03)\nwine\n0.93 \u00b1 0.06 (0.03 \u00b1 0.01)\nyacht\n0.96 \u00b1 0.01 (0.15 \u00b1 0.25)\nnaval\n0.94 \u00b1 0.03 (0.12 \u00b1 0.18)\nenergy\n0.94 \u00b1 0.03 (0.55 \u00b1 0.20)\nboston\nkin8nm 0.93 \u00b1 0.01 (0.20 \u00b1 0.01)\n\nDropout\n\nnone\n\nnone\n\n0.94 \u00b1 0.00 (0.37 \u00b1 0.00)\n0.97 \u00b1 0.03 (0.10 \u00b1 0.01)\n0.96 \u00b1 0.01 (0.23 \u00b1 0.00)\n0.91 \u00b1 0.04 (0.17 \u00b1 0.01)\n\nnone\nnone\n\nGradientBoostingQR\n\n0.93 \u00b1 0.00 (0.71 \u00b1 0.00)\n\nnone\nnone\n\n0.95 \u00b1 0.02 (0.79 \u00b1 0.01)\n\nnone\nnone\n\n0.89 \u00b1 0.00 (0.75 \u00b1 0.00)\n\nnone\n\nQualityDriven\n\nQuantileForest\n\nSQR (ours)\n\nnone\n\nnone\n\nconcrete\n0.93 \u00b1 0.02 (0.34 \u00b1 0.19)\npower\nwine\n0.92 \u00b1 0.05 (0.04 \u00b1 0.01)\nyacht\n0.94 \u00b1 0.02 (0.21 \u00b1 0.11)\nnaval\n0.91 \u00b1 0.04 (0.10 \u00b1 0.05)\nenergy\nboston\nkin8nm 0.96 \u00b1 0.00 (0.84 \u00b1 0.00)\n\nnone\n\nnone\n\n0.96 \u00b1 0.01 (0.37 \u00b1 0.02)\n0.94 \u00b1 0.01 (0.18 \u00b1 0.00)\n0.97 \u00b1 0.04 (0.28 \u00b1 0.11)\n0.92 \u00b1 0.01 (0.22 \u00b1 0.00)\n0.95 \u00b1 0.02 (0.15 \u00b1 0.01)\n0.95 \u00b1 0.03 (0.37 \u00b1 0.02)\n\nnone\n\n0.94 \u00b1 0.03 (0.31 \u00b1 0.06)\n0.93 \u00b1 0.01 (0.18 \u00b1 0.01)\n0.93 \u00b1 0.03 (0.45 \u00b1 0.04)\n0.93 \u00b1 0.06 (0.06 \u00b1 0.04)\n0.95 \u00b1 0.02 (0.12 \u00b1 0.09)\n0.94 \u00b1 0.03 (0.08 \u00b1 0.03)\n0.92 \u00b1 0.06 (0.36 \u00b1 0.09)\n0.93 \u00b1 0.01 (0.23 \u00b1 0.02)\n\nwell-calibrated prediction intervals; in particular, PICPs closer to 0.95 are best. Among those models,\nin parenthesis, we show their test average and standard deviation MPIW; here, smallest MPIWs\nare best. We show \u201cnone\u201d for methods that could not \ufb01nd a single model with the desired PICP\nbounds in a given dataset. Overall, our method SQR is able to \ufb01nd the narrowest well-calibrated\nprediction intervals. The only other method able to \ufb01nd well-calibrated prediction intervals for all\ndataset is the simple but strong baseline ConditionalGaussian. However, our SQR model estimates a\nnon-parametric conditional distribution for the target variable, which can be helpful in subsequent\ntasks. We obtained similar results for 90% and 99% prediction intervals.\n\n4.2 Epistemic Uncertainty: Out-of-Distribution Detection\n\nWe evaluate the ability of Orthonormal Certi\ufb01cates, OCs (3), to estimate epistemic uncertainty in the\ntask of out-of-distribution example detection. We consider four classi\ufb01cation datasets with ten classes:\nMNIST, CIFAR-10, Fashion-MNIST, and SVHN. We split each of these datasets at random into\n\ufb01ve \u201cin-domain\u201d classes and \ufb01ve \u201cout-of-domain\u201d classes. Then, we train PreActResNet18 [34, 51]\nand VGG [71] models on the training split of the in-domain classes. Finally, we use a measure of\nepistemic uncertainty on top of the last layer features to distinguish between the testing split of the\nin-domain classes and the testing split of the out-of-domain classes. These two splits are roughly the\nsame size in all datasets, so we use the ROC AUC to measure the performance of different epistemic\nuncertainty estimates at the task of distinguishing in- versus out-of- test instances.\n\nWe note that our experimental setup is much more challenging than the usual considered in one-class\nclassi\ufb01cation (where one is interested in a single in-domain class) or out-of-distribution literature\n(where the in- and out-of- domain classes often belong to different datasets).\n\nWe compare orthonormal certi\ufb01cates (3) to a variety of well known uncertainty estimates. These\ninclude Bayesian linear regression (covariance), distance to nearest training points (distance), largest\nsoftmax score [35, largest], absolute difference between the two largest softmax scores (functional),\nsoftmax entropy (entropy), geometrical margin [80, geometrical], ODIN [50], random network distil-\nlation [9, distillation], principal component analysis (PCA), Deep Support Vector Data Description\n[65, SVDD], BALD [40], a random baseline (random), and an oracle trained to separate the in- and\nout-of- domain examples. From all these methods, BALD is a Dropout-based ensemble method, that\nrequires training the entire neural network from scratch [26].\n\n6\n\n\fTable 2: ROC AUC means and standard deviations of out-of-distribution detection experiments. All\nmethods except BALD are single-model and work on top of the last layer of a previously trained\nnetwork. Each BALD requires ensembling predictions and training the neural network from scratch.\n\ncovariance\ndistance\ndistillation\nentropy\nfunctional\ngeometrical\nlargest\nODIN\nPCA\nrandom\nSVDD\nBALD\u2020\n\nOCs (ours)\nunregularized OCs\n\noracle\n\ncifar\n0.64 \u00b1 0.00\n0.60 \u00b1 0.11\n0.53 \u00b1 0.01\n0.80 \u00b1 0.01\n0.79 \u00b1 0.00\n0.70 \u00b1 0.11\n0.78 \u00b1 0.02\n0.74 \u00b1 0.09\n0.60 \u00b1 0.09\n0.50 \u00b1 0.00\n0.52 \u00b1 0.01\n0.80 \u00b1 0.04\n0.83 \u00b1 0.01\n0.78 \u00b1 0.00\n0.94 \u00b1 0.00\n\nfashion\n0.71 \u00b1 0.13\n0.73 \u00b1 0.10\n0.62 \u00b1 0.03\n0.86 \u00b1 0.01\n0.87 \u00b1 0.02\n0.66 \u00b1 0.07\n0.85 \u00b1 0.02\n0.84 \u00b1 0.00\n0.57 \u00b1 0.07\n0.51 \u00b1 0.00\n0.54 \u00b1 0.03\n0.95 \u00b1 0.02\n0.92 \u00b1 0.00\n0.87 \u00b1 0.00\n1.00 \u00b1 0.00\n\nmnist\n0.81 \u00b1 0.00\n0.74 \u00b1 0.10\n0.71 \u00b1 0.05\n0.91 \u00b1 0.01\n0.92 \u00b1 0.01\n0.75 \u00b1 0.10\n0.89 \u00b1 0.01\n0.89 \u00b1 0.00\n0.64 \u00b1 0.06\n0.51 \u00b1 0.00\n0.59 \u00b1 0.03\n0.95 \u00b1 0.02\n0.95 \u00b1 0.00\n0.91 \u00b1 0.00\n1.00 \u00b1 0.00\n\nsvhn\n0.56 \u00b1 0.00\n0.64 \u00b1 0.13\n0.56 \u00b1 0.03\n0.93 \u00b1 0.01\n0.92 \u00b1 0.00\n0.77 \u00b1 0.13\n0.93 \u00b1 0.01\n0.88 \u00b1 0.08\n0.55 \u00b1 0.03\n0.50 \u00b1 0.00\n0.51 \u00b1 0.01\n0.90 \u00b1 0.01\n0.91 \u00b1 0.00\n0.88 \u00b1 0.00\n0.99 \u00b1 0.00\n\nTable 2 shows the ROC AUC mean and standard deviations (across hyper-parameter con\ufb01gurations)\nfor all methods and datasets. In these experiments, the variance across hyper-parameter con\ufb01gurations\nis specially important, since we lack of out-of-distribution examples during training and validation.\nSimple methods such as functional, largest, and entropy show non-trivial performances at the task of\ndetecting out-of-distribution examples. This is inline with the results of [35]. OCs achieve the overall\nbest average accuracy (90%, followed by the entropy method at 87%), with little or null variance\nacross hyper-parameters (the results are stable for any regularization strength \u03bb > 1). OCs are also\nthe best method to reject samples across datasets, able to use an MNIST network to reject 96% of\nFashion-MNIST examples (followed by ODIN at 85%).\n\n5 Related Work\n\nAleatoric uncertainty Capturing aleatoric uncertainty is learning about the conditional distribution\nof a target variable given values of a input variable. One classical strategy to achieve this goal is to\nassume that such conditional distribution is Gaussian at all input locations. Then, one can dedicate\none output of the neural network to estimate the conditional variance of the target via maximum\nlikelihood estimation [57, 44, 49]. While simple, this strategy is restricted to model Gaussian aleatoric\nuncertainties, which are symmetric and unimodal. These methods can be understood as the neural\nnetwork analogues of the aleatoric uncertainty estimates provided by Gaussian processes [63].\n\nA second strategy, implemented by [60], is to use quality metrics for predictive intervals (such as\nPICP/MPIW) as a learning objective. This strategy leads to well-calibrated prediction intervals. Other\nBayesian methods [33] predict other uncertainty scalar statistics (such as conditional entropy) to\nmodel aleatoric uncertainty. However, these estimates summarize conditional distributions into scalar\nvalues, and are thus unable to distinguish between unimodal and multimodal uncertainty pro\ufb01les.\n\nIn order to capture complex (multimodal, asymmetric) aleatoric uncertainties, a third strategy is to use\nimplicit generative models [55]. These are predictors that accept a noise vector as an additional input,\nto provide multiple predictions at any given location. These are trained to minimize the divergence\nbetween the conditional distribution of their multiple predictions and the one of the available data,\nbased on samples. The multiple predictions can later be used as an empirical distribution of the\naleatoric uncertainty. Some of these models are conditional generative adversarial networks [54] and\nDiscoNets [7]. However, these models are dif\ufb01cult to train, and suffer from problems such as \u201cmode\ncollapse\u201d [29], which would lead to wrong prediction intervals.\n\nA fourth popular type of non-linear quantile regression techniques are based on decision trees.\nHowever, these estimate a separate model per quantile level [77, 86], or require an a-priori \ufb01nite\n\n7\n\n\fdiscretization of the quantile levels [81, 64, 10, 85]. The one method able to estimate all non-linear\nquantiles jointly in this category is Quantile Random Forests [53], outperformed by SQR.\n\nAlso related to SQR, there are few examples on using the pinball loss to train neural networks\nfor quantile regression. These considered the estimation of individual quantile levels [83, 78], or\nunconditional quantile estimates with no applications to uncertainty estimation [16, 59].\n\nEpistemic uncertainty Capturing epistemic uncertainty is learning about what regions of the input\nspace are unexplored by the training data. As we review in this section, most estimates of epistemic\nuncertainty are based on measuring the discrepancy between different predictors trained on the same\ndata. These include the seminal works on bootstrapping [21], and bagging [8] from statistics. Recent\nneural network methods to estimate epistemic uncertainty follow this principle [49].\n\nAlthough the previous references follow a frequentist approach, the strategy of ensembling models is\na natural \ufb01t for Bayesian methods, since these could measure the discrepancy between the (possibly\nin\ufb01nitely) many amount of hypotheses contained in a posterior distribution. Since exact Bayesian\ninference is intractable for deep neural networks, recent years have witnessed a big effort in developing\napproximate alternatives. First, some works [6, 37] place an independent Gaussian prior for each\nweight in a neural network, and then learn the means and variances of these Gaussians using\nbackpropgation. After training, the weight variances can be used to sample diverse networks, used to\nobtain diverse predictions and the corresponding estimate of epistemic uncertainty. A second line\nof work [25, 24, 44] employs dropout [38] during the training and evaluation of a neural network as\nan alternative way to obtain an ensemble of predictions. Since dropout has been replaced to a large\nextent by batch normalization [42, 34], Teye et al. [79] showed how to use batch normalization to\nobtain ensembles of predictions from a single neural network.\n\nA second strategy to epistemic uncertainty is to use data as a starting point to construct \u201cnegative\nexamples\u201d. These negative examples resemble realistic input con\ufb01gurations that would lay outside\nthe data distribution. Then, a predictor to distinguish between original training points and negative\nexamples may be used to measure epistemic uncertainty. Examples of these strategy include noise-\ncontrastive estimation [32], noise-contrastive priors [33], and GANs [30].\n\nIn machine learning literature, the estimation of epistemic uncertainty is often motivated in terms of\ndetecting out-of-distribution examples [69]. However, the often ignored literature on anomaly/outlier\ndetection and one-class classi\ufb01cation can also be seen as an effort to estimate epistemic uncertainty\n[61]. Even though one-class classi\ufb01cation methods are implemented mostly in terms of kernel\nmethods [66, 67], there are recent extensions to leverage deep neural networks [65].\n\nMost related to our orthonormal certi\ufb01cates, the method deep Support Vector Data Description [65,\nSVDD] also trains a function to map all in-domain examples to a constant value. However, their\nevaluation is restricted to the task of one-class classi\ufb01cation, and our experiments showcase that\ntheir performance is drastically reduced when the in-domain data is more diverse (contains more\nclasses). Also, our orthonormal certi\ufb01cates do not require learning a deep model end-to-end, but\ncan be applied to the last layer representation of any pre-trained network. Finally, we found that our\nproposed diversity regularizer (4) was crucial to obtain a diverse, well-performing set of certi\ufb01cates.\n\n6 Conclusion\n\nMotivated by the importance of quantifying con\ufb01dence in the predictions of deep models, we proposed\nsimple, yet effective tools to measure aleatoric and epistemic uncertainty in deep learning models. To\ncapture aleatoric uncertainty, we proposed Simultaneous Quantile Regression (SQR), implemented in\nterms of minimizing a randomized version of the pinball loss function. SQR estimators model the\nentire conditional distribution of a target variable, and provide well-calibrated prediction intervals.\nTo model epistemic uncertainty, we propose the use of Orthonormal Certi\ufb01cates (OCs), a collection\nof diverse non-trivial functions that map the training samples to zero.\n\nFurther work could be done in a number of directions. On the one hand, OCs could be used in active\nlearning scenarios or for detecting adversarial samples. On the other hand, we would like to apply\nSQR to probabilistic forecasting [28], and extreme value theory [18].\n\nWe hope that our uncertainty estimates are easy to implement, and provide a solid baseline to increase\nthe robustness and trust of deep models in sensitive domains.\n\n8\n\n\fReferences\n\n[1] A. Abadie and J. Angrist. Instrumental variables estimation of quantile treatment effects, 1998.\n\n[2] C. Achilles, H. P. Bain, F. Bellott, J. Boyd-Zaharias, J. Finn, J. Folger, J. Johnston, and\nE. Word. Tennessee\u2019s student teacher achievement ratio (star) project, 2008. URL https:\n//hdl.handle.net/1902.1/10766.\n\n[3] D. Alvarez-Melis and T. Jaakkola. A causal framework for explaining the predictions of\n\nblack-box sequence-to-sequence models. In EMNLP, 2017.\n\n[4] A. Asuncion and D. Newman. UCI machine learning repository, 2007.\n\n[5] E. Begoli, T. Bhattacharya, and D. Kusnezov. The need for uncertainty quanti\ufb01cation in\n\nmachine-assisted medical decision making. Nature Machine Intelligence, 2019.\n\n[6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural\n\nnetworks. In ICML, 2015.\n\n[7] D. Bouchacourt, P. K. Mudigonda, and S. Nowozin. Disco nets: Dissimilarity coef\ufb01cients\n\nnetworks. In NeurIPS, 2016.\n\n[8] L. Breiman. Bagging predictors. Machine learning, 1996.\n\n[9] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation.\n\narXiv, 2018.\n\n[10] A. J. Cannon. Non-crossing nonlinear regression quantiles by monotone composite quantile\nregression neural network, with application to rainfall extremes. Stochastic environmental\nresearch and risk assessment, 2018.\n\n[11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing\n\nsurveys (CSUR), 2009.\n\n[12] T. Chen, J. Navr\u00b4atil, V. Iyengar, and K. Shanmugam. Con\ufb01dence scoring using whitebox\n\nmeta-models with linear classi\ufb01er probes. arXiv, 2018.\n\n[13] Z. Chen, K. Zhang, L. Chan, and B. Sch\u00a8olkopf. Causal discovery via reproducing kernel hilbert\n\nspace embeddings. Neural computation, 2014.\n\n[14] C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 2016.\n\n[15] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,\n\nsignals and systems, 1989.\n\n[16] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional\n\nreinforcement learning. arXiv, 2018.\n\n[17] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Sch\u00a8olkopf.\n\nInferring deterministic causal relations. arXiv, 2012.\n\n[18] L. De Haan and A. Ferreira. Extreme value theory: an introduction. 2007.\n\n[19] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of\nuncertainty in Bayesian deep learning for ef\ufb01cient and risk-sensitive learning. In ICML, 2018.\n\n[20] A. Der Kiureghian and O. Ditlevsen. Aleatory or epistemic? does it matter? Structural Safety,\n\n2009.\n\n[21] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n\n[22] T. S. Ferguson. Mathematical statistics: A decision theoretic approach. 1967.\n\n[23] M. Fox and H. Rubin. Admissibility of quantile estimates of a single location parameter. The\n\nAnnals of Mathematical Statistics, 1964.\n\n9\n\n\f[24] Y. Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.\n\n[25] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncer-\n\ntainty in deep learning. In ICML, 2016.\n\n[26] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. In ICML,\n\n2017.\n\n[27] Y. Geifman and R. El-Yaniv. Selective classi\ufb01cation for deep neural networks. In NIPS, 2017.\n\n[28] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 2007.\n\n[29] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv, 2016.\n\n[30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[31] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nwasserstein gans. In NeurIPS, 2017.\n\n[32] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for\n\nunnormalized statistical models. In AISTATS, 2010.\n\n[33] D. Hafner, D. Tran, A. Irpan, T. Lillicrap, and J. Davidson. Reliable uncertainty estimates in\n\ndeep neural networks using noise contrastive priors. arXiv, 2018.\n\n[34] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,\n\n2016.\n\n[35] D. Hendrycks and K. Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. ICLR, 2017.\n\n[36] D. Hernandez-Lobato, P. Morales-Mombiela, D. Lopez-Paz, and A. Suarez. Non-linear causal\n\ninference using gaussianity measures. JMLR, 2016.\n\n[37] J. M. Hern\u00b4andez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of\n\nbayesian neural networks. In ICML, 2015.\n\n[38] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving\n\nneural networks by preventing co-adaptation of feature detectors. arXiv, 2012.\n\n[39] V. Hodge and J. Austin. A survey of outlier detection methodologies. Arti\ufb01cial intelligence\n\nreview, 2004.\n\n[40] N. Houlsby, F. Husz\u00b4ar, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classi\ufb01ca-\n\ntion and preference learning. arXiv, 2011.\n\n[41] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal discovery\n\nwith additive noise models. In NeurIPS, 2009.\n\n[42] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv, 2015.\n\n[43] D. Janzing and B. Scholkopf. Causal inference using the algorithmic markov condition. IEEE\n\nTransactions on Information Theory, 2010.\n\n[44] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer\n\nvision? In NeurIPS, 2017.\n\n[45] M. E. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava. Fast and scalable\n\nbayesian deep learning by weight-perturbation in adam. In ICML, 2018.\n\n[46] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.\n\n[47] R. Koenker. Quantile Regression. Cambridge University Press, 2005.\n\n10\n\n\f[48] R. Koenker and G. Bassett Jr. Regression quantiles. Econometrica: Journal of the Econometric\n\nSociety, 1978.\n\n[49] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty\n\nestimation using deep ensembles. In NeurIPS, 2017.\n\n[50] S. Liang, Y. Li, and R. Srikant. Enhancing The Reliability of Out-of-distribution Image\n\nDetection in Neural Networks. ICLR, 2018.\n\n[51] K. Liu, 2018. URL https://github.com/kuangliu/pytorch-cifar/blob/master/\n\nmodels/preact_resnet.py.\n\n[52] D. Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016.\n\n[53] N. Meinshausen. Quantile regression forests. JMLR, 2006.\n\n[54] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv, 2014.\n\n[55] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. arXiv, 2016.\n\n[56] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch\u00a8olkopf. Distinguishing cause from\n\neffect using observational data: methods and benchmarks. JMLR, 2016.\n\n[57] D. A. Nix and A. S. Weigend. Estimating the mean and variance of the target probability\n\ndistribution. In ICNN, 1994.\n\n[58] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In\n\nNeurIPS, 2016.\n\n[59] G. Ostrovski, W. Dabney, and R. Munos. Autoregressive quantile networks for generative\n\nmodeling. arXiv, 2018.\n\n[60] T. Pearce, M. Zaki, A. Brintrup, and A. Neely. High-Quality Prediction Intervals for Deep\n\nLearning: A Distribution-Free, Ensembled Approach. In ICML, 2018.\n\n[61] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection.\n\nSignal Processing, 2014.\n\n[62] Z. Qu and J. Yoon. Nonparametric estimation and inference on conditional quantile processes.\n\nJournal of Econometrics, 2015.\n\n[63] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine\n\nlearning. 2004.\n\n[64] F. Rodrigues and F. C. Pereira. Beyond expectation: Deep joint mean and quantile regression\n\nfor spatio-temporal problems. arXiv, 2018.\n\n[65] L. Ruff, N. G\u00a8ornitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen, A. Binder, E. M\u00a8uller, and\n\nM. Kloft. Deep one-class classi\ufb01cation. In ICML, 2018.\n\n[66] B. Sch\u00a8olkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector\n\nmethod for novelty detection. In NeurIPS, 2000.\n\n[67] B. Sch\u00a8olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the\n\nsupport of a high-dimensional distribution. Neural computation, 2001.\n\n[68] B. Settles. Active learning literature survey. 2010. Computer Sciences Technical Report, 2014.\n\n[69] A. Shafaei, M. Schmidt, and J. J. Little. Does your model know the digit 6 is not a cat? a less\n\nbiased evaluation of\u201d outlier\u201d detectors. arXiv, 2018.\n\n[70] S. Shimizu, P. O. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen. A linear non-gaussian acyclic model\n\nfor causal discovery. JMLR, 2006.\n\n[71] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. ICLR, 2015.\n\n11\n\n\f[72] H. J. Smith, T. Dinev, and H. Xu. Information privacy research: an interdisciplinary review.\n\nMIS quarterly, 2011.\n\n[73] O. Stegle, D. Janzing, K. Zhang, J. M. Mooij, and B. Sch\u00a8olkopf. Probabilistic latent variable\n\nmodels for distinguishing between cause and effect. In NeurIPS, 2010.\n\n[74] I. Steinwart, A. Christmann, et al. Estimating conditional quantiles with the help of the pinball\n\nloss. Bernoulli, 2011.\n\n[75] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks, 2014.\n\n[76] N. Tagasovska, T. Vatter, and V. Chavez-Demoulin. Nonparametric quantile-based causal\n\ndiscovery. arXiv, 2018.\n\n[77] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola. Nonparametric quantile estimation. JMLR,\n\n2006.\n\n[78] J. W. Taylor. A quantile regression neural network approach to estimating the conditional\n\ndensity of multiperiod returns. Journal of Forecasting, 2000.\n\n[79] M. Teye, H. Azizpour, and K. Smith. Bayesian uncertainty estimation for batch normalized\n\ndeep networks. ICML, 2018.\n\n[80] D. Wang and Y. Shang. A new active labeling method for deep learning. In IJCNN, 2014.\n\n[81] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka. A multi-horizon quantile recurrent\n\nforecaster. arXiv, 2017.\n\n[82] Y. Wen, B. Ellingwood, D. Veneziano, and J. Bracci. Uncertainty modeling in earthquake\n\nengineering. MAE center project FD-2 report, 2003.\n\n[83] H. White. Nonparametric estimation of conditional quantiles using neural networks.\n\nIn\n\nComputing Science and Statistics. 1992.\n\n[84] K. Zhang and A. Hyv\u00a8arinen. On the identi\ufb01ability of the post-nonlinear causal model. In UAI,\n\n2009.\n\n[85] W. Zhang, H. Quan, and D. Srinivasan. An improved quantile regression neural network for\n\nprobabilistic load forecasting. IEEE Transactions on Smart Grid, 2018.\n\n[86] S. Zheng. Boosting based conditional quantile estimation for regression and binary classi\ufb01cation.\n\nIn Mexican International Conference on Arti\ufb01cial Intelligence, 2010.\n\n[87] L. Zhu and N. Laptev. Deep and con\ufb01dent prediction for time series at uber. In ICDMW, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3463, "authors": [{"given_name": "Natasa", "family_name": "Tagasovska", "institution": "University of Lausanne"}, {"given_name": "David", "family_name": "Lopez-Paz", "institution": "Facebook AI Research"}]}