{"title": "Addressing Failure Prediction by Learning Model Confidence", "book": "Advances in Neural Information Processing Systems", "page_first": 2902, "page_last": 2913, "abstract": "Assessing reliably the confidence of a deep neural net and predicting its failures is of primary importance for the practical deployment of these models. In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). We show how using the TCP is more suited than relying on the classic Maximum Class Probability (MCP). We provide in addition theoretical guarantees for TCP in the context of failure prediction. Since the true class is by essence unknown at test time, we propose to learn TCP criterion on the training set, introducing a specific learning scheme adapted to this context. Extensive experiments are conducted for validating the relevance of the proposed approach. We study various network architectures, small and large scale datasets for image classification and semantic segmentation. We show that our approach consistently outperforms several strong methods, from MCP to Bayesian uncertainty, as well as recent approaches specifically designed for failure prediction.", "full_text": "Addressing Failure Prediction\nby Learning Model Con\ufb01dence\n\nCharles Corbi\u00e8re1,2\n\ncharles.corbiere@valeo.com\n\nNicolas Thome1\n\nnicolas.thome@cnam.fr\n\nAvner Bar-Hen1\navner@cnam.fr\n\nMatthieu Cord2,3\n\nmatthieu.cord@lip6.fr\n\nPatrick P\u00e9rez2\n\npatrick.perez@valeo.com\n\n1CEDRIC, Conservatoire National des Arts et M\u00e9tiers, Paris, France\n\n2valeo.ai, Paris, France\n\n3Sorbonne University, Paris, France\n\nAbstract\n\nAssessing reliably the con\ufb01dence of a deep neural network and predicting its fail-\nures is of primary importance for the practical deployment of these models. In this\npaper, we propose a new target criterion for model con\ufb01dence, corresponding to\nthe True Class Probability (TCP). We show how using the TCP is more suited than\nrelying on the classic Maximum Class Probability (MCP). We provide in addition\ntheoretical guarantees for TCP in the context of failure prediction. Since the true\nclass is by essence unknown at test time, we propose to learn TCP criterion on\nthe training set, introducing a speci\ufb01c learning scheme adapted to this context.\nExtensive experiments are conducted for validating the relevance of the proposed\napproach. We study various network architectures, small and large scale datasets for\nimage classi\ufb01cation and semantic segmentation. We show that our approach con-\nsistently outperforms several strong methods, from MCP to Bayesian uncertainty,\nas well as recent approaches speci\ufb01cally designed for failure prediction.\n\n1\n\nIntroduction\n\nDeep neural networks have seen a wide adoption, driven by their impressive performance in various\ntasks including image classi\ufb01cation [25], object recognition [43, 33, 37], natural language processing\n[34, 35], and speech recognition [18, 15]. Despite their growing success, safety remains a great\nconcern when it comes to implement these models in real-world conditions [1, 19]. Estimating when a\nmodel makes an error is even more crucial in applications where failing carries serious repercussions,\nsuch as in autonomous driving, medical diagnosis or nuclear power plant monitoring [32].\nThis paper addresses the challenge of failure prediction with deep neural networks [17, 20, 16].\nThe objective is to provide con\ufb01dence measures for model\u2019s predictions that are reliable and whose\nranking among samples enables to distinguish correct from incorrect predictions. Equipped with such\na con\ufb01dence measure, a system could decide to stick to the prediction or, on the contrary, to hand\nover to a human or a back-up system with, e.g. other sensors, or simply to trigger an alarm.\nIn the context of classi\ufb01cation, a widely used baseline for con\ufb01dence estimation with neural networks\nis to take the value of the predicted class\u2019 probability, namely the Maximum Class Probability (MCP),\ngiven by the softmax layer output. Although recent evaluations of MCP for failure prediction with\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: When ranking test samples according to Maximum Class Probability (a), output by a\nconvolutional model trained on CIFAR-10 dataset, we observe that correct predictions (in green) and\nincorrect ones (in red) overlap considerably, making it dif\ufb01cult to distinguish them. On the other\nhand, ranking samples according to True Class Probability (b) alleviates this issue and allows a better\nseparation for failure prediction. (Distributions of both correct and incorrect samples are plotted in\nrelative density for visualization purpose).\n\nmodern deep models reveal reasonable performances [17], they still suffer from several conceptual\ndrawbacks. Softmax probabilities are indeed known to be non-calibrated [13, 40], sensitive to\nadversarial attacks [12, 44], and inadequate for detecting in- from out-of-distribution examples [17,\n30, 26].\nAnother important issue related to MCP, which we speci\ufb01cally address in this work, relates to ranking\nof con\ufb01dence scores: this ranking is unreliable for the task of failure prediction [41, 20]. As illustrated\nin Figure 1(a) for a small convolutional network trained on CIFAR-10 dataset, MCP con\ufb01dence\nvalues for erroneous and correct predictions overlap. It is worth mentioning that this problem comes\nfrom the fact that MCP leads by design to high con\ufb01dence values, even for erroneous ones, since the\nlargest softmax output is used. On the other hand, the probability of the model with respect to the true\nclass naturally re\ufb02ects a better behaved model con\ufb01dence, as illustrated in Figure 1(b). This leads to\nerrors\u2019 con\ufb01dence distributions shifted to smaller values, while correct predictions are still associated\nwith high values, allowing a much better separability between these two types of prediction.\nBased on this observation, we propose a novel approach for failure prediction with deep neural\nnetworks. We introduce a new con\ufb01dence criteria based on the idea of using the TCP (section 2.1),\nfor which we provide theoretical guarantees in the context of failure prediction. Since the true class\nis obviously unknown at test time, we introduce a method to learn a given target con\ufb01dence criterion\nfrom data (section 2.2). We also discuss connections and differences between related works for\nfailure prediction, in particular Bayesian deep learning and ensemble approaches, as well as recent\napproaches designing alternative criteria for failure prediction (section 2.3). We conduct extensive\ncomparative experiments across various tasks, datasets and network architectures to validate the\nrelevance of our proposed approach (section 3.2). Finally, a thorough analysis of our approach\nregarding the choice of loss function, criterion and learning scheme is presented in section 3.3.\n\n2 Failure prediction by learning model con\ufb01dence\n\nWe are interested in the problem of de\ufb01ning relevant con\ufb01dence criteria for failure prediction with\ndeep neural networks, in the context of classi\ufb01cation. We also address semantic image segmentation,\nwhich can be seen as a pixel-wise classi\ufb01cation problem, where a model outputs a dense segmentation\nmask with a predicted class assigned to each pixel. As such, all the following material is formulated\nfor classi\ufb01cation, and implementation details for segmentation are speci\ufb01ed when necessary.\nLet us consider a dataset D which consists of N i.i.d. training samples D = {(xi, y\u2217\ni )}N\ni=1 where\nxi \u2208 Rd is a d-dimensional feature and y\u2217\ni \u2208 Y = {1, ..., K} is its true class. We view a classi\ufb01cation\nneural network as a probabilistic model: given an input x, the network assigns a probabilistic\npredictive distribution P (Y |w, x) by computing the softmax output for each class k and where w\n\n2\n\n\fare the parameters of the network. From this predictive distribution, one can infer the class predicted\nby the model as \u02c6y = argmax\n\nP (Y = k|w, x).\n\nk\u2208Y\n\nDuring training, network parameters w are learned following a maximum likelihood estimation\nframework where one minimizes the Kullback-Leibler (KL) divergence between the predictive\ndistribution and the true distribution. In classi\ufb01cation, this is equivalent to minimizing the cross-\nentropy loss w.r.t. w, which is the negative sum of the log-probabilities over positive labels:\n\nLCE(w;D) = \u2212 1\nN\n\ny\u2217\ni log P (Y = y\u2217\n\ni |w, xi).\n\n(1)\n\nN(cid:88)\n\ni=1\n\n2.1 Con\ufb01dence criterion for failure prediction\n\nInstead of trying to improve the accuracy of a given trained model, we are interested in knowing if\nit can be endowed with the ability to recognize when its prediction may be wrong. A con\ufb01dence\ncriterion is a quantitative measure to estimate the con\ufb01dence of the model prediction. The higher the\nvalue, the more certain the model about its prediction. As such, a suitable con\ufb01dence criterion should\ncorrelate erroneous predictions with low values and successful predictions with high values. Here,\nwe speci\ufb01cally focus on the ability of the con\ufb01dence criterion to separate successful and erroneous\npredictions in order to distinguish them.\nFor a given input x, a standard approach is to compute the softmax probability of the predicted class\n\u02c6y, that is the Maximum Class Probability: MCP(x) = max\n\nk\u2208Y P (Y = k|w, x) = P (Y = \u02c6y|w, x).\n\nBy taking the largest softmax probability, MCP leads to high con\ufb01dence values both for errors and\ncorrect predictions, making it hard to distinguish them, as shown in Figure 1(a). On the other hand,\nwhen the model is misclassifying an example, the probability associated to the true class y\u2217 would be\nmore likely close to a low value, re\ufb02ecting the fact that the model made an error. Thus, we propose to\nconsider the True Class Probability as a suitable con\ufb01dence criterion for failure prediction:\n\nTCP : Rd \u00d7 Y \u2192 R\n\n(2)\nTheoretical guarantees. With TCP, the following properties hold (see derivation in supplementary\n1.1). Given an example (x, y\u2217),\n\n(x , y\u2217) \u2192 P (Y = y\u2217|w, x)\n\n\u2022 TCP(x, y\u2217) > 1/2 \u21d2 \u02c6y = y\u2217, i.e. the example is properly classi\ufb01ed by the model,\n\u2022 TCP(x, y\u2217) < 1/K \u21d2 \u02c6y (cid:54)= y\u2217, i.e. the example is wrongly classi\ufb01ed by the model.\n\nWithin the range [1/K, 1/2], there is no theoretical guarantee that correct and incorrect predictions\nwill not overlap in terms of TCP. However, when using deep neural networks, we observe that the\nactual overlap area is extremely small in practice, as illustrated in Figure 1(b) on the CIFAR-10\ndataset. One possible explanation comes from the fact that modern deep neural networks output\novercon\ufb01dent predictions and therefore non-calibrated probabilities [13]. We provide consolidated\nresults and analysis on this aspect in Section 3 and in the supplementary 1.2.\nWe also introduce a normalized variant of the TCP con\ufb01dence criterion, which consists in computing\nthe ratio between TCP and MCP:\n\nTCPr(x, y\u2217) =\n\nP (Y = y\u2217|w, x)\nP (Y = \u02c6y|w, x)\n\n.\n\n(3)\n\nThe TCPr criterion presents stronger theoretical guarantees than TCP, since correct predictions will\nbe, by design, assigned the value of 1, whereas errors will range in [0, 1[. On the other hand, learning\nthis criterion may be more challenging since all correct predictions must match a single scalar value.\n\n2.2 Learning TCP con\ufb01dence with deep neural networks\n\nUsing TCP as con\ufb01dence criterion on a model\u2019s output would be of great help when it comes\nto predicting failures. However, the true class y\u2217 of an output is obviously not available when\n\n3\n\n\fFigure 2: Our approach is based on two sub-networks. The classi\ufb01cation model with parameters\nw is composed of a succession of convolutional and dense layers (\u2018ConvNet\u2019) followed by a \ufb01nal\ndense layer with softmax activation. The con\ufb01dence network, \u2018Con\ufb01dNet\u2019, builds upon features maps\nextracted by ConvNet, and is composed of a succession of layers which output a con\ufb01dence score\n\u02c6c(x, \u03b8) \u2208 [0, 1].\n\nestimating con\ufb01dence on test samples. Thus, we propose to learn TCP con\ufb01dence c\u2217(x, y\u2217) =\nP (Y = y\u2217|w, x) 1, our target con\ufb01dence value. We introduce a con\ufb01dence neural network, termed\nCon\ufb01dNet, with parameters \u03b8, which outputs a con\ufb01dence prediction \u02c6c(x, \u03b8). During training, we\nseek \u03b8 such that \u02c6c(x, \u03b8) is close to c\u2217(x, y\u2217) on training samples (see Figure 2).\nCon\ufb01dNet builds upon a classi\ufb01cation neural network M, whose parameters w are preliminary\nlearned using cross-entropy loss LCE in (1). We are not concerned with improving model M\u2019s\naccuracy. As a consequence, its classi\ufb01cation layers (last fully connected layer and subsequent\noperations) will be \ufb01xed from now on.\n\nCon\ufb01dence network design. During initial classi\ufb01cation training, model M learns to extract\nincreasingly complex features that are fed to the classi\ufb01cation layers. To bene\ufb01t from these rich\nrepresentations, we build Con\ufb01dNet on top of them: Con\ufb01dNet passes these features through a\nsuccession of dense layers with a \ufb01nal sigmoid activation that outputs a scalar \u02c6c(x, \u03b8) \u2208 [0, 1].\nNote that in semantic segmentation, models consist of fully convolutional networks where hidden\nrepresentations are 2D feature maps. Con\ufb01dNet can bene\ufb01t from this spatial information by replacing\ndense layers by 1 \u00d7 1 convolutions with adequate number of channels.\n\nLoss function. Since we want to regress a score between 0 and 1, we use the (cid:96)2 loss to train\nCon\ufb01dNet:\n\nN(cid:88)\n\ni=1\n\nLconf (\u03b8;D) =\n\n1\nN\n\n(\u02c6c(xi, \u03b8) \u2212 c\u2217(xi, y\u2217\n\ni ))2.\n\n(4)\n\nIn the experimental part, we also tried more direct approaches for failure prediction such as a binary\ncross entropy loss (BCE) between the con\ufb01dence network score and a incorrect/correct prediction\ntarget. We also tried implementing Focal loss [31], a BCE variant which focuses on hard examples.\nFinally, one can also see failure detection as a ranking problem where good predictions must be\nranked before erroneous ones according to a con\ufb01dence criterion. To this end, we also implemented a\nranking loss [36, 7] applied locally on training batch inputs.\n\nLearning scheme. Our complete con\ufb01dence model, from input image to con\ufb01dence score, shares\nits \ufb01rst encoding part (\u2018ConvNet\u2019 in Fig.2) with the classi\ufb01cation model M. The training of Con\ufb01dNet\n\n1or its normalized variant TCPr(x, y\u2217).\n\n4\n\n\fstarts by \ufb01xing entirely M (freezing w) and learning \u03b8 using loss (4). In a next step, we can then\n\ufb01ne-tune the ConvNet encoder. However, as model M has to remain \ufb01xed to compute similar\nclassi\ufb01cation predictions, we have now to decouple the feature encoders used for classi\ufb01cation and\ncon\ufb01dence prediction respectively. We also deactivate dropout layers in this last training phase and\nreduce learning rate to mitigate stochastic effects that may lead the new encoder to deviate too much\nfrom the original one used for classi\ufb01cation. Data augmentation can thus still be used.\n\n2.3 Related works\n\nCon\ufb01dence estimation has already raised interest in the machine learning community over the past\ndecade. Blatz et al. [3] introduce a method similar to our BCE baseline for con\ufb01dence estimation in\nmachine translation but their approach is not dedicated to training deep neural networks. Similarly,\n[42, 29] mention the use of bi-directional lattice RNN speci\ufb01cally designed for con\ufb01dence estimation\nin speech recognition, whereas Con\ufb01dNet offers a model- and task-agnostic approach which can\nbe plugged into any deep neural network. Post-hoc selective classi\ufb01cation methods [11] identify a\nthreshold over a con\ufb01dence-rate function (e.g., MCP) to satisfy a user-speci\ufb01ed risk level, whereas\nwe focus here on relative metrics. Recently, Hendricks et al. [17] established a standard baseline for\ndeep neural networks which relies on MCP retrieved from softmax distribution. As stated before,\nMCP presents several limits regarding both failure prediction and out-of-distribution detection as\nit outputs high con\ufb01dence values. This limit is alleviated in our TCP criterion which also provides\nsome interesting theoretical guarantees regarding con\ufb01dence threshold.\nIn [20], Jiang et al. propose a new con\ufb01dence measure, \u2018Trust Score\u2019, which measures the agreement\nbetween the classi\ufb01er and a modi\ufb01ed nearest-neighbor classi\ufb01er on the test examples. More precisely,\nthe con\ufb01dence criterion used in Trust Score [20] is the ratio between the distance from the sample\nto the nearest class different from the predicted class and the distance to the predicted class. One\nclear drawback of this approach is its lack of scalability, since computing nearest neighbors in large\ndatasets is extremely costly in both computation and memory. Another more fundamental limitation\nrelated to the Trust Score itself is that local distance computation becomes less meaningful in high\ndimensional spaces [2], which is likely to negatively affect performances of this method. In contrast,\nCon\ufb01dNet is based on a training approach which learns a sub-manifold in the error/success space,\nwhich is arguably less prone to the curse of dimensionality and, therefore, facilitate discrimination\nbetween these classes.\nBayesian approaches for uncertainty estimation in neural networks gained a lot of attention recently,\nespecially due to the elegant connection between ef\ufb01cient stochastic regularization techniques,\ne.g. dropout [10], and variational inference in Bayesian neural networks [10, 9, 4, 21, 22]. Gal and\nGhahramani proposed in [10] using Monte Carlo Dropout (MCDropout) to estimate the posterior\npredictive network distribution by sampling several stochastic network predictions. When applied\nto regression, the predictive distribution uncertainty can be summarized by computing statistics,\ne.g. variance. When using MCDropout for uncertainty estimation in classi\ufb01cation tasks, however,\nthe predictive distribution is averaged to a point-wise softmax estimate before computing standard\nuncertainty criteria, e.g. entropy or variants such as mutual information. It is worth mentioning that\nthese entropy-based criteria measure the softmax output dispersion, where the uniform distribution has\nmaximum entropy. It is not clear how well these dispersion measures are adapted for distinguishing\nfailures from correct predictions, especially with deep neural networks which output overcon\ufb01dent\npredictions [13]: for example, it might be very challenging to discriminate a peaky prediction\ncorresponding to a correct prediction from an incorrect overcon\ufb01dent one. We illustrate this issue in\nsection 3.2.\nIn tasks closely related to failure prediction, other approaches also identi\ufb01ed the issue of MCP\nregarding high con\ufb01dence predictions [17, 30, 26, 28, 13, 40]. Guo et al. [13], for con\ufb01dence\ncalibration, and Liang et al. [30], for out-of-distribution detection, proposed to use temperature\nscaling to mitigate con\ufb01dence values. However, this doesn\u2019t affect the ranking of the con\ufb01dence\nscore and therefore the separability between errors and correct predictions. DeVries et al. [6] share\nwith us the same purpose of learning con\ufb01dence in neural networks. Their work differs by focusing\non out-of-distribution detection and learning jointly a distribution con\ufb01dence score and classi\ufb01cation\nprobabilities. In addition, they use predicted con\ufb01dence score to interpolate output probabilities and\ntarget whereas we speci\ufb01cally de\ufb01ne TCP, a criterion suited for failure prediction.\n\n5\n\n\fLakshminarayanan et al. [26] propose an alternative to Bayesian neural networks by leveraging\nensemble of neural networks to produce well-calibrated uncertainty estimates. Part of their approach\nrelies on using a proper scoring rule as training criterion. It is interesting to note that our TCP criterion\ncorresponds actually to the exponential cross-entropy loss value of a model prediction, which is a\nproper scoring rule in the case of multi-class classi\ufb01cation.\n\n3 Experiments\n\nIn this section, we evaluate our approach to predict failure in both classi\ufb01cation and segmentation\nsettings. First, we run comparative experiments against state-of-the-art con\ufb01dence estimation and\nBayesian uncertainty estimation methods on various datasets. These results are then completed by\na thorough analysis of the in\ufb02uence of the con\ufb01dence criterion, the training loss and the learning\nscheme in our approach. Finally, we provide a few visualizations to get additional insight into the\nbehavior of our approach. Our code is available at https://github.com/valeoai/Con\ufb01dNet.\n\n3.1 Experimental setup\n\nDatasets. We run experiments on image datasets of varying scale and complexity: MNIST [27]\nand SVHN [39] datasets provide relatively simple and small (28 \u00d7 28) images of digits (10 classes).\nCIFAR-10 and CIFAR-100 [24] propose more complex object recognition tasks on low resolution\nimages. We also report experiments for semantic segmentation on CamVid [5], a standard road scene\ndataset. Further details about these datasets, as well as on architectures, training and metrics can be\nfound in supplementary 2.1.\nNetwork architectures. The classi\ufb01cation deep architectures follow those proposed in [20] for fair\ncomparison. They range from small convolutional networks for MNIST and SVHN to larger VGG-16\narchitecture for the CIFAR datasets. We also added a multi-layer perceptron (MLP) with 1 hidden\nlayer for MNIST to investigate performances on small models. For CamVid, we implemented a\nSegNet semantic segmentation model, following [21].\nOur con\ufb01dence prediction network, Con\ufb01dNet, is attached to the penultimate layer of the classi\ufb01cation\nnetwork. It is composed of a succession of 5 dense layers. Variants of this architecture have been\ntested, leading to similar performances (see supplementary 2.2 for more details). Following our\nspeci\ufb01c learning scheme, we \ufb01rst train Con\ufb01dNet layers before \ufb01ne-tuning the duplicate ConvNet\nencoder dedicated to con\ufb01dence estimation. In the context of semantic segmentation, we adapt\nCon\ufb01dNet by making it fully convolutional.\nEvaluation metrics. We measure the quality of failure prediction following the standard metrics\nused in the literature [17]: AUPR-Error, AUPR-Success, FPR at 95% TPR and AUROC. We will\nmainly focus on AUPR-Error, which computes the area under the Precision-Recall curve using errors\nas the positive class.\n\n3.2 Comparative results on failure prediction\n\nTo demonstrate the effectiveness of our method, we implemented competitive con\ufb01dence and un-\ncertainty estimation approaches including Maximum Class Probability (MCP) as a baseline [17],\nTrust Score [20], and Monte-Carlo Dropout (MCDropout) [10]. For Trust Score, we used the code\nprovided by the authors2. Further implementation details and parameter settings are available in the\nsupplementary 2.1.\nComparative results are summarized in Table 1. First of all, we observe that our approach outperforms\nbaseline methods in every setting, with a signi\ufb01cant gap on small models/datasets. This con\ufb01rms both\nthat TCP is an adequate con\ufb01dence criterion for failure prediction and that our approach Con\ufb01dNet\nis able to learn it. TrustScore method also presents good results on small datasets/models such as\nMNIST where it improved baseline. While Con\ufb01dNet still performs well on more complex datasets,\nTrust Score\u2019s performance drops, which might be explained by high dimensionality issues with\ndistances as mentioned in section 2.3. For its application to semantic segmentation where each\ntraining pixel is a \u2018neighbor\u2019, computational complexity forced us to reduce drastically the number\nof training neighbors and of test samples. We sampled randomly in each train and test image a\n\n2https://github.com/google/TrustScore\n\n6\n\n\fDataset\n\nMNIST\nMLP\n\nMNIST\n\nSmall ConvNet\n\nTable 1: Comparison of failure prediction methods on various datasets. All methods share the same\nclassi\ufb01cation network. Note that for MCDropout, test accuracy is averaged over random sampling.\nAll values are percentages.\nModel\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\nBaseline (MCP) [17]\nMCDropout [10]\nTrustScore [20]\nCon\ufb01dNet (Ours)\n\nFPR-95%-TPR AUPR-Error AUPR-Success AUC\n97.13\n97.15\n97.52\n97.83\n98.63\n98.65\n98.20\n98.82\n93.20\n92.85\n92.16\n93.44\n91.53\n92.08\n88.47\n92.12\n85.67\n86.09\n84.17\n86.28\n84.42\n84.58\n68.33\n85.02\n\n99.94\n99.94\n99.95\n99.95\n99.99\n99.99\n99.98\n99.99\n99.54\n99.52\n99.48\n99.55\n99.19\n99.27\n98.76\n99.24\n92.49\n92.96\n91.58\n92.68\n96.37\n96.40\n92.72\n96.58\n\n37.70\n38.22\n52.18\n57.37\n35.05\n38.50\n35.88\n45.89\n48.18\n43.87\n43.32\n50.72\n45.36\n46.40\n38.10\n49.94\n71.99\n72.59\n66.82\n73.68\n48.53\n49.35\n20.42\n50.51\n\n14.87\n15.15\n12.31\n11.79\n5.56\n5.26\n10.00\n3.33\n31.28\n36.60\n34.74\n28.58\n47.50\n49.02\n55.70\n44.94\n67.86\n64.68\n71.74\n62.96\n63.87\n62.95\n\n61.52\n\nSVHN\n\nSmall ConvNet\n\nCIFAR-10\nVGG16\n\nCIFAR-100\n\nVGG16\n\nCamVid\nSegNet\n\n(a) CIFAR-10\n\n(b) SVHN\n\nFigure 3: Risk-coverage curves. \u2018Selective risk\u2019 (y-axis) represents the percentage of errors in the\nremaining test set for a given coverage percentage.\n\nsmall percentage of pixels to compute TrustScore. Con\ufb01dNet, in contrast, is as fast as the original\nsegmentation network.\nWe also improve state-of-art performances from MCDropout. While MCDropout leverages ensem-\nbling based on dropout layers, taking as con\ufb01dence measure the entropy on the average softmax\ndistribution may not be always adequate. In Figure 4, we show side-by-side two samples with a\nsimilar distribution entropy. Left image is misclassi\ufb01ed while right one enjoys a correct prediction.\nEntropy is a symmetric measure in regards to class probabilities: a correct prediction with [0.65, 0.35]\ndistribution is evaluated as con\ufb01dent as an incorrect one with [0.35, 0.65] distribution. In contrast,\nour approach can discriminate an incorrect from a correct prediction despite both having similarly\nspread distributions.\n\n7\n\n\fFigure 4: Illustrating the limits of MCDropout with entropy as con\ufb01dence estimation on SVHN test\nsamples. Red-border image (a) is misclassi\ufb01ed by the classi\ufb01cation model; green-border image (b)\nis correctly classi\ufb01ed. Prediction exhibit similar high entropy in both cases. For each sample, we\nprovide a plot of their softmax predictive distribution.\n\nRisk-coverage curves [8, 11] depicting the performance of Con\ufb01dNet and other baselines for CIFAR-\n10 and SVHN datasets appear in Figure 3. \u2018Coverage\u2019 corresponds to the probability mass of the\nnon-rejected region after using a threshold as selection function [11]. For both datasets, Con\ufb01dNet\npresents a better coverage potential for each selective risk that a user can choose beforehand. In\naddition, we can see that the improvement is more pronounced at high coverage rates - e.g.\nin\n[0.8; 0.95] for CIFAR-10 (Fig. 3a) and in [0.86; 0.96] for SVHN (Fig. 3b) - which highlights the\ncapacity of Con\ufb01dNet to identify successfully critical failures.\n\n3.3 Effect of learning variants\n\nCIFAR-100\n\nVGG-16\n72.68%\n73.68%\n\n43.94%\n45.89%\n\nMNIST\n\nSmallConvNet\n\nCon\ufb01dence training\n+ Fine-tuning ConvNet\n\nTable 2: Effect of learning scheme on AUPR-Error\n\nWe \ufb01rst evaluate the effect\nof \ufb01ne-tuning ConvNet in\nour approach. Without\n\ufb01ne-tuning, Con\ufb01dNet al-\nready achieves signi\ufb01cant\nimprovements w.r.t. base-\nline, as shown in Table 2.\nBy allowing subsequent\n\ufb01ne-tuning as described in\nsection 2.2, Con\ufb01dNet performance is further boosted in every setting, around 1-2%. Note that using\na vanilla \ufb01ne-tuning without deactivating dropout layers did not bring any improvement.\nGiven the small number of errors available due to deep neural network over-\ufb01tting, we also experi-\nmented with training Con\ufb01dNet on a hold-out dataset. We report results on all datasets in Table 3 for\nvalidation sets with 10% of samples. We observe a general performance drop when using a validation\nset for training TCP con\ufb01dence. The drop is especially pronounced for small datasets (MNIST),\nwhere models reach >97% train and val accuracies. Consequently, with a high accuracy and a small\nvalidation set, we do not get a larger absolute number of errors using val set compared to train set.\nOne solution would be to increase validation set size but this would damage model\u2019s prediction per-\nformance. By contrast, we take care with our approach to base our con\ufb01dence estimation on models\nwith levels of test predictive performance that are similar to those of baselines. On CIFAR-100, the\ngap between train accuracy and val accuracy is substantial (95.56% vs. 65.96%), which may explain\nthe slight improvement for con\ufb01dence estimation using val set (+0.17%). We think that training\nCon\ufb01dNet on val set with models reporting low/middle test accuracies could improve the approach.\n\nTable 3: Comparison between training Con\ufb01dNet on train set or on validation set\n\nAUPR-Error (%)\n\nCon\ufb01dNet (using train set)\nCon\ufb01dNet (using val set)\n\nMNIST\nMLP\n57.34%\n33.41%\n\nMNIST\n\nSmallConvNet\n\nSVHN\n\nSmallConvNet\n\n43.94%\n34.22%\n\n50.72%\n47.96%\n\nCIFAR-10 CIFAR-100 CamVid\nSegNet\nVGG-16\n50.28%\n49.94%\n48.93%\n50.15%\n\nVGG-16\n73.68%\n73.85%\n\nOn Table 4, we compare training Con\ufb01dNet with MSE loss to binary classi\ufb01cation cross-\nentropy loss (BCE). Even though BCE speci\ufb01cally addresses the failure prediction task, we\n\n8\n\n\fFocal\nobserve that\nloss and ranking loss were also tested and presented similar results (see supplementary 2.3).\n\nit achieves lower performances on CIFAR-10 and CamVid datasets.\n\nTable 4: Effect of loss and normalized criterion on AUPR-Error\n\nWe intuitively think that\nTCP regularizes training\nby providing more \ufb01ne-\ngrained information about\nthe quality of the classi\ufb01er\nregarding a sample\u2019s pre-\ndiction. This is especially\nimportant in the dif\ufb01cult\nlearning con\ufb01guration where only very few error samples are available due to the good performance\nof the classi\ufb01er. We also evaluate the impact of regression to the normalized criterion T CP r: per-\nformance is lower than the one of TCP on small datasets such as CIFAR-10 where few errors are\npresent, but higher on larger datasets such as CamVid where each pixel is a sample. This emphasizes\nonce again the complexity of incorrect/correct classi\ufb01cation training.\n\nLoss Criterion\nTCPr\nBCE\n48.78%\n49.94% 47.95%\n50.51% 48.96%\n51.35%\n\nDataset\n\nCIFAR-10\nCamVid\n\nTCP\n\n3.4 Qualitative assessments\n\nIn this last subsection, we provide an illustration on CamVid (Figure 5) to better understand our\napproach for failure prediction. Compared to MCP baseline, our approach produces higher con\ufb01dence\nscores for correct pixel predictions and lower ones on erroneously predicted pixels, which allow an\nuser to better detect errors area in semantic segmentation.\n\nFigure 5: Comparison of inverse con\ufb01dence (uncertainty) map between Con\ufb01dNet (e) and MCP (f) on\none CamVid scene. The top row shows the input image (a) with its ground-truth (b) and the semantic\nsegmentation mask (c) predicted by the original classi\ufb01cation model. The error map associated to\nthe predicted segmentation is shown in (d), with erroneous predictions \ufb02agged in white. Con\ufb01dNet\n(55.53% AP-Error) allows a better prediction of these errors than MCP (54.69% AP-Error).\n\n4 Conclusion\n\nIn this paper, we de\ufb01ned a new con\ufb01dence criterion, TCP, which provides both theoretical guarantees\nand empirical evidences to address failure prediction. We proposed a speci\ufb01c method to learn this\ncriterion with a con\ufb01dence neural network built upon a classi\ufb01cation model. Results showed a\nsigni\ufb01cant improvement from strong baselines on various classi\ufb01cation and semantic segmentation\ndatasets, which validate the effectiveness of our approach. Further works involve exploring methods\nto arti\ufb01cially generate errors, such as in adversarial training. Con\ufb01dNet could also be applied for\nuncertainty estimation in domain adaptation [45, 14] or in multi-task learning [23, 38].\n\n9\n\n\fReferences\n[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016. 1\n\n[2] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is \u201cnearest\n\nneighbor\u201d meaningful? In ICDT, 1999. 5\n\n[3] John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza,\nIn\n\nAlberto Sanchis, and Nicola Uef\ufb01ng. Con\ufb01dence estimation for machine translation.\nCOLING, 2004. 5\n\n[4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. In ICML, 2015. 5\n\n[5] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A\n\nhigh-de\ufb01nition ground truth database. Pattern Recogn. Lett., 30(2):88\u201397, 2009. 6\n\n[6] Terrance DeVries and Graham W Taylor. Learning con\ufb01dence for out-of-distribution detection\n\nin neural networks. arXiv preprint arXiv:1802.04865, 2018. 5\n\n[7] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Mantra: Minimum maximum latent\n\nstructural SVM for image classi\ufb01cation and ranking. In ICCV, 2015. 4\n\n[8] Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classi\ufb01cation. J.\n\nMach. Learn. Res., 11:1605\u20131641, 2010. 8\n\n[9] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016. 5\n\n[10] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\n\nuncertainty in deep learning. In ICML, 2016. 5, 6, 7\n\n[11] Yonatan Geifman and Ran El-Yaniv. Selective classi\ufb01cation for deep neural networks. In NIPS,\n\n2017. 5, 8\n\n[12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014. 2\n\n[13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural\n\nnetworks. In ICML, 2017. 2, 3, 5\n\n[14] Ligong Han, Yang Zou, Ruijiang Gao, Lezi Wang, and Dimitris Metaxas. Unsupervised domain\n\nadaptation via calibrating uncertainties. In CVPR Workshops, 2019. 9\n\n[15] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan\nPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech:\nScaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014. 1\n\n[16] Simon Hecker, Dengxin Dai, and Luc Van Gool. Failure prediction for autonomous driving. In\n\nIV, 2018. 1\n\n[17] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. In ICLR, 2017. 1, 2, 5, 6, 7\n\n[18] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep\nJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural\nnetworks for acoustic modeling in speech recognition: The shared views of four research groups.\nIEEE Signal Processing Magazine, 29(6):82\u201397, 2012. 1\n\n[19] Joel Janai, Fatma G\u00fcney, Aseem Behl, and Andreas Geiger. Computer vision for au-\ntonomous vehicles: Problems, datasets and state-of-the-art. arXiv preprint arXiv:1704.05519,\nabs/1704.05519, 2017. 1\n\n[20] Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. To trust or not to trust a classi\ufb01er.\n\nIn NIPS, 2018. 1, 2, 5, 6, 7\n\n10\n\n\f[21] Alex Kendall, Vijay Badrinarayanan, , and Roberto Cipolla. Bayesian SegNet: Model un-\ncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv\npreprint arXiv:1511.02680, 2015. 5, 6\n\n[22] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for\n\ncomputer vision? In NIPS, 2017. 5\n\n[23] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh\n\nlosses for scene geometry and semantics. In CVPR, June 2018. 9\n\n[24] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nthesis, Department of Computer Science, University of Toronto, 2009. 6\n\n[25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, 2012. 1\n\n[26] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems 30, 2017. 2, 5, 6\n\n[27] Yann LeCun and Corinna Cortes.\n\nhttp://yann.lecun.com/exdb/mnist, 1998. 6\n\nThe MNIST database of handwritten digits.\n\n[28] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training con\ufb01dence-calibrated classi\ufb01ers\n\nfor detecting out-of-distribution samples. In ICLR, 2018. 5\n\n[29] Qiujia Li, Preben Ness, Anton Ragni, and M.J.F. Gales. Bi-directional lattice recurrent neural\nnetworks for con\ufb01dence estimation. In IEEE International Conference on Acoustics, Speech\nand Signal Processing, 10 2018. 5\n\n[30] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image\n\ndetection in neural networks. In ICLR, 2018. 2, 5\n\n[31] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll\u00e1r. Focal loss for dense object detection. In\n\nICCV, 2017. 4\n\n[32] Ondrej Linda, Todd Vollmer, and Milos Manic. Neural network based intrusion detection\n\nsystem for critical infrastructures. In IJCNN, 2009. 1\n\n[33] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu,\n\nand Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016. 1\n\n[34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, abs/1301.3781, 2013. 1\n\n[35] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur. Recur-\nrent neural network based language model. In Takao Kobayashi, Keikichi Hirose, and Satoshi\nNakamura, editors, INTERSPEECH, 2010. 1\n\n[36] Pritish Mohapatra, Michal Rol\u00ednek, C.V. Jawahar, Vladimir Kolmogorov, and M. Pawan Kumar.\n\nEf\ufb01cient optimization for rank-based loss functions. In CVPR, June 2018. 4\n\n[37] Taylor Mordan, Nicolas Thome, Gilles Henaff, and Matthieu Cord. End-to-end learning of\nlatent deformable part-based representations for object detection. International Journal of\nComputer Vision, pages 1\u201321, 07 2018. 1\n\n[38] Taylor Mordan, Nicolas Thome, Gilles Henaff, and Matthieu Cord. Revisiting multi-task\n\nlearning with ROCK: a deep residual auxiliary block for visual detection. In NIPS, 2018. 9\n\n[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.\n6\n\n[40] L. Neumann, A. Zisserman, and A. Vedaldi. Relaxed softmax: Ef\ufb01cient con\ufb01dence auto-\n\ncalibration for safe pedestrian detection. In NIPS Workshops, 2018. 2, 5\n\n11\n\n\f[41] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High\n\ncon\ufb01dence predictions for unrecognizable images. In CVPR, 2015. 2\n\n[42] A. Ragni, Q. Li, M. J. F. Gales, and Y. Wang. Con\ufb01dence estimation and deletion prediction\n\nusing bidirectional recurrent neural networks. In SLT Workshop, 2018. 5\n\n[43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time\n\nobject detection with region proposal networks. In NIPS, 2015. 1\n\n[44] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J.\n\nGoodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. 2\n\n[45] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick P\u00e9rez. ADVENT:\nAdversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR,\n2019. 9\n\n12\n\n\f", "award": [], "sourceid": 1674, "authors": [{"given_name": "Charles", "family_name": "Corbi\u00e8re", "institution": "Valeo.ai / CNAM"}, {"given_name": "Nicolas", "family_name": "THOME", "institution": "Cnam (Conservatoire national des arts et m\u00e9tiers)"}, {"given_name": "Avner", "family_name": "Bar-Hen", "institution": "CNAM, Paris"}, {"given_name": "Matthieu", "family_name": "Cord", "institution": "Sorbonne University"}, {"given_name": "Patrick", "family_name": "P\u00e9rez", "institution": "Valeo.ai"}]}