{"title": "A Probabilistic U-Net for Segmentation of Ambiguous Images", "book": "Advances in Neural Information Processing Systems", "page_first": 6965, "page_last": 6975, "abstract": "Many real-world vision problems suffer from inherent ambiguities. In clinical applications for example, it might not be clear from a CT scan alone which particular region is cancer tissue. Therefore a group of graders typically produces a set of diverse but plausible segmentations. We consider the task of learning a distribution over segmentations given an input. To this end we propose a generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder that is capable of efficiently producing an unlimited number of plausible hypotheses. We show on a lung abnormalities segmentation task and on a Cityscapes segmentation task that our model reproduces the possible segmentation variants as well as the frequencies with which they occur, doing so significantly better than published approaches. These models could have a high impact in real-world applications, such as being used as clinical decision-making algorithms accounting for multiple plausible semantic segmentation hypotheses to provide possible diagnoses and recommend further actions to resolve the present ambiguities.", "full_text": "A Probabilistic U-Net for Segmentation of Ambiguous\n\nImages\n\nSimon A. A. Kohl1\u2217,2,, Bernardino Romera-Paredes1, Clemens Meyer1, Jeffrey De Fauw1,\nJoseph R. Ledsam1, Klaus H. Maier-Hein2, S. M. Ali Eslami1, Danilo Jimenez Rezende1, and\n\nOlaf Ronneberger1\n\n2Division of Medical Image Computing, German Cancer Research Center, Heidelberg, Germany\n\n1DeepMind, London, UK\n\n{brp,meyerc,defauw,jledsam,aeslami,danilor,olafr}@google.com\n\n{simon.kohl,k.maier-hein}@dkfz.de\n\nAbstract\n\nMany real-world vision problems suffer from inherent ambiguities. In clinical\napplications for example, it might not be clear from a CT scan alone which par-\nticular region is cancer tissue. Therefore a group of graders typically produces\na set of diverse but plausible segmentations. We consider the task of learning a\ndistribution over segmentations given an input. To this end we propose a generative\nsegmentation model based on a combination of a U-Net with a conditional vari-\national autoencoder that is capable of ef\ufb01ciently producing an unlimited number\nof plausible hypotheses. We show on a lung abnormalities segmentation task\nand on a Cityscapes segmentation task that our model reproduces the possible\nsegmentation variants as well as the frequencies with which they occur, doing so\nsigni\ufb01cantly better than published approaches. These models could have a high\nimpact in real-world applications, such as being used as clinical decision-making\nalgorithms accounting for multiple plausible semantic segmentation hypotheses to\nprovide possible diagnoses and recommend further actions to resolve the present\nambiguities.\n\n1\n\nIntroduction\n\nThe semantic segmentation task assigns a class label to each pixel in an image. While in many\ncases the context in the image provides suf\ufb01cient information to resolve the ambiguities in this\nmapping, there exists an important class of images where even the full image context is not suf\ufb01cient\nto resolve all ambiguities. Such ambiguities are common in medical imaging applications, e.g.,\nin lung abnormalities segmentation from CT images. A lesion might be clearly visible, but the\ninformation about whether it is cancer tissue or not might not be available from this image alone.\nSimilar ambiguities are also present in photos. E.g. a part of fur visible under the sofa might belong\nto a cat or a dog, but it is not possible from the image alone to resolve this ambiguity2. Most existing\nsegmentation algorithms either provide only one likely consistent hypothesis (e.g., \u201call pixels belong\nto a cat\u201d) or a pixel-wise probability (e.g., \u201ceach pixel is 50% cat and 50% dog\u201d).\nEspecially in medical applications where a subsequent diagnosis or a treatment depends on the seg-\nmentation map, an algorithm that only provides the most likely hypothesis might lead to misdiagnoses\n\n\u2217work done during an internship at DeepMind.\n2In [1] this is de\ufb01ned as ambiguous evidence in contrast to implicit class confusion, that stems from an\nambiguous class de\ufb01nition (e.g. the concepts of desk vs. table). For the presented work this differentiation is not\nrequired.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The Probabilistic U-Net. (a) Sampling process. Arrows: \ufb02ow of operations; blue blocks:\nfeature maps. The heatmap represents the probability distribution in the low-dimensional\nlatent space RN (e.g., N = 6 in our experiments). For each execution of the network,\none sample z \u2208 RN is drawn to predict one segmentation mask. Green block: N-channel\nfeature map from broadcasting sample z. The number of feature map blocks shown is\nreduced for clarity of presentation. (b) Training process illustrated for one training example.\nGreen arrows: loss functions.\n\nand sub-optimal treatment. Providing only pixel-wise probabilities ignores all co-variances between\nthe pixels, which makes a subsequent analysis much more dif\ufb01cult if not impossible. If multiple\nconsistent hypotheses are provided, these can be directly propagated into the next step in a diagnosis\npipeline, they can be used to suggest further diagnostic tests to resolve the ambiguities, or an expert\nwith access to additional information can select the appropriate one(s) for the subsequent steps.\nHere we present a segmentation framework that provides multiple segmentation hypotheses for\nambiguous images (Fig. 1a). Our framework combines a conditional variational auto encoder (CVAE)\n[2, 3, 4, 5] which can model complex distributions, with a U-Net [6] which delivers state-of-the-art\nsegmentations in many medical application domains. A low-dimensional latent space encodes the\npossible segmentation variants. A random sample from this space is injected into the U-Net to\nproduce the corresponding segmentation map. One key feature of this architecture is the ability to\nmodel the joint probability of all pixels in the segmentation map. This results in multiple segmentation\nmaps, where each of them provides a consistent interpretation of the whole image. Furthermore our\nframework is able to also learn hypotheses that have a low probability and to predict them with the\ncorresponding frequency. We demonstrate these features on a lung abnormalities segmentation task,\nwhere each lesion has been segmented independently by four experts, and on the Cityscapes dataset,\nwhere we arti\ufb01cially \ufb02ip labels with a certain frequency during training.\nA body of work with different approaches towards probabilistic and multi-modal segmentation exists.\nThe most common approaches provide independent pixel-wise probabilities [7, 8]. These models\ninduce a probability distribution by using dropout over spatial features. Whereas this strategy ful\ufb01lls\nthis line of work\u2019s objective of quantifying the pixel-wise uncertainty, it produces inconsistent outputs.\nA simple way to produce plausible hypotheses is to learn an ensemble of (deep) models [9]. While\nthe outputs produced by ensembles are consistent, they are not necessarily diverse and ensembles\nare typically not able to learn the rare variants as their members are trained independently. In order\nto overcome this, several approaches train models jointly using the oracle set loss [10], i.e. a loss\nthat only accounts for the closest prediction to the ground truth. This has been explored in [11] and\n[1] using an ensemble of deep networks, and in [12] and [13] using one common deep network with\nM heads. While multi-head approaches may have the capacity to capture a diverse set of variants,\nthey are not equipped to learn the occurrence frequencies of individual variants. Two common\ndisadvantages of both ensembles and M heads models are their ungraceful scaling to large numbers\nof hypotheses, and their requirement of \ufb01xing the number of allowed hypotheses at training time.\nAnother set of approaches to produce multiple diverse solutions relies on graphical models, such as\njunction chains [14], and more generally Markov Random Fields [15, 16, 17, 18]. While many of the\n\n2\n\nzU-NetImageSample (cid:6884)KLPrior NetPosterior NetPredicted SegmentationGroundTruthCross-EntropyLatent Space(cid:7524)prior(cid:7705)prior(cid:7524)post,(cid:7705)postU-NetImageSamplePrior NetLatent Space(cid:7524)prior(cid:7705)priorz1z2z3*123z(cid:6884)1,(cid:6884)2,(cid:6884)3,...ab\fprevious approaches are guaranteed to \ufb01nd the best diverse solutions, these are con\ufb01ned to structured\nproblems whose dependencies can be described by tractable graphical models.\nThe task of image-to-image translation [19] tackles a very similar problem: an under-constrained\ndomain transfer of images needs to be learned. Many of the recent approaches employ generative\nadversarial networks (GANs) which are known to suffer from challenges such as \u2018mode-collapse\u2019 [20].\nIn an attempt to solve the mode-collapse problem, the \u2018bicycleGAN\u2019 [21] involves a component that\nis similar in architecture to ours. In contrast to our proposed architecture, their model encompasses\na \ufb01xed prior distribution and during training their posterior distribution is only conditioned on the\noutput image. Very recent work on generating appearances given a shape encoding [22] also combines\na U-Net with a VAE, and was developed concurrently to ours. In contrast to our proposal, their\ntraining requires an additional pretrained VGG-net that is employed as a reconstruction loss. Finally,\nin [23] is proposed a probabilistic model for structured outputs based on optimizing the dissimilarity\ncoef\ufb01cient [24] between the ground truth and predicted distributions. The resultant approach is\nassessed on the task of hand pose estimation, that is, predicting the location of 14 joints, arguably a\nsimpler space compared to the space of segmentations we consider here. Similarly to the approach\npresented below, they inject latent variables at a later stage of the network architecture.\nThe main contributions of this work are: (1) Our framework provides consistent segmentation maps\ninstead of pixel-wise probabilities and can therefore give a joint likelihood of modes. (2) Our model\ncan induce arbitrarily complex output distributions including the occurrence of very rare modes,\nand is able to learn calibrated probabilities of segmentation modes. (3) Sampling from our model\nis computationally cheap. (4) In contrast to many existing applications of deep generative models\nthat can only be qualitatively evaluated, our application and datasets allow quantitative performance\nevaluation including penalization of missing modes.\n\n2 Network Architecture and Training Procedure\n\nOur proposed network architecture is a combination of a conditional variational auto encoder [2, 3, 4,\n5] with a U-Net [6], with the objective of learning a conditional density model over segmentations,\nconditioned on the image.\nSampling. The central component of our architecture (Fig. 1a) is a low-dimensional latent space\nRN (e.g., N = 6, which performed best in our experiments). Each position in this space encodes a\nsegmentation variant. The \u2018prior net\u2019, parametrized by weights \u03c9, estimates the probability of these\nvariants for a given input image X. This prior probability distribution (called P in the following) is\nmodelled as an axis-aligned Gaussian with mean \u00b5prior(X; \u03c9) \u2208 RN and variance \u03c3prior(X; \u03c9) \u2208 RN .\nTo predict a set of m segmentations we apply the network m times to the same input image (only\na small part of the network needs to be re-evaluated in each iteration, see below). In each iteration\ni \u2208 {1, . . . , m}, we draw a random sample zi \u2208 RN from P\n\nzi \u223c P (\u00b7|X) = N(cid:16)\n\n(cid:17)\n\n\u00b5prior(X; \u03c9), diag(\u03c3prior(X; \u03c9))\n\n,\n\n(1)\n\nbroadcast the sample to an N-channel feature map with the same shape as the segmentation map, and\nconcatenate this feature map to the last activation map of a U-Net (the U-Net is parameterized by\nweights \u03b8). A function fcomb. composed of three subsequent 1 \u00d7 1 convolutions (\u03c8 being the set of\ntheir weights) combines the information and maps it to the desired number of classes. The output, Si,\nis the segmentation map corresponding to point zi in the latent space:\n\n(cid:0)fU-Net(X; \u03b8), zi; \u03c8(cid:1) .\n\nSi = fcomb.\n\n(2)\n\nNotice that when drawing m samples for the same input image, we can reuse the output of the prior\nnet and the feature activations of the U-Net. Only the function fcomb. needs to be re-evaluated m\ntimes.\nTraining. The networks are trained with the standard training procedure for conditional VAEs\n(Fig. 1b), i.e. by minimizing the variational lower bound (Eq. 4). The main difference with respect to\ntraining a deterministic segmentation model, is that the training process additionally needs to \ufb01nd a\nuseful embedding of the segmentation variants in the latent space. This is solved by introducing a\n\u2018posterior net\u2019, parametrized by weights \u03bd, that learns to recognize a segmentation variant (given the\nraw image X and the ground truth segmentation Y ) and to map this to a position \u00b5post(X, Y ; \u03bd) \u2208 RN\n\n3\n\n\fwith some uncertainty \u03c3post(X, Y ; \u03bd) \u2208 RN in the latent space. The output is denoted as posterior\ndistribution Q. A sample z from this distribution,\n\nz \u223c Q(\u00b7|X, Y ) = N(cid:16)\n\n(cid:17)\n\n\u00b5post(X, Y ; \u03bd), diag(\u03c3post(X, Y ; \u03bd))\n\n,\n\n(3)\n\ncombined with the activation map of the U-Net (Eq. 1) must result in a predicted segmentation S\nidentical to the ground truth segmentation Y provided in the training example. A cross-entropy loss\npenalizes differences between S and Y (the cross-entropy loss arises from treating the output S as the\nparameterization of a pixel-wise categorical distribution Pc). Additionally there is a Kullback-Leibler\ndivergence DKL(Q||P ) = Ez\u223cQ\ndistribution Q and the prior distribution P . Both losses are combined as a weighted sum with a\nweighting factor \u03b2, as done in [25]:\n\n(cid:2)log Q \u2212 log P(cid:3) which penalizes differences between the posterior\n(cid:104) \u2212 log Pc(Y |S(X, z))\n\nL(Y, X) = Ez\u223cQ(\u00b7|Y,X)\n\nQ(z|Y, X)||P (z|X)\n\n+ \u03b2 \u00b7 DKL\n\n(cid:17)\n\n.\n\n(cid:16)\n\n(cid:105)\n\n(4)\n\nThe training is done from scratch with randomly initialized weights. During training, this KL loss\n\u201cpulls\u201d the posterior distribution (which encodes a segmentation variant) and the prior distribution\ntowards each other. On average (over multiple training examples) the prior distribution will be\nmodi\ufb01ed in a way such that it \u201ccovers\u201d the space of all presented segmentation variants for a speci\ufb01c\ninput image3.\n\n3 Performance Measures and Baseline Methods\n\nIn this section we \ufb01rst present the metric used to assess the performance of all approaches, and then\ndescribe each competitor approach used in the comparisons.\n\n3.1 Performance measures\n\nAs it is common in the semantic segmentation literature, we employ the intersection over union\n(IoU) as a measure to compare a pair of segmentations. However, in the present case, we not only\nwant to compare a deterministic prediction with a unique ground truth, but rather we are interested\nin comparing distributions of segmentations. To do so, we use the generalized energy distance\n[26, 27, 28], which leverages distances between observations:\n\nGED(Pgt, Pout) = 2E(cid:104)\n\nD2\n\n(cid:105) \u2212 E(cid:104)\n\nd(S, Y )\n\n(cid:105) \u2212 E(cid:104)\n\n(cid:105)\n\nd(S, S\n\n(cid:48)\n\n)\n\n(cid:48)\n\nd(Y, Y\n\n)\n\n,\n\n(5)\n\n(cid:48)\n\n(cid:48)\n\nare independent samples from the ground truth distribution\nwhere d is a distance measure, Y and Y\nPgt, and similarly, S and S\nare independent samples from the predicted distribution Pout. The\nenergy distance DGED is a metric as long as d is also a metric [29]. In our case we choose d(x, y) =\n1 \u2212 IoU(x, y), which as proved in [30, 31], is a metric. In practice, we only have access to samples\nfrom the distributions that models induce, so we rely on statistics of Eq. 5, \u02c6D2\nGED. The details about\nits computation for each experiment are presented in Appendix B.\n\n3.2 Baseline methods\n\nWith the aim of providing context for the performance of our proposed approach we compare against\na range of baselines. To the best of our knowledge there exists no other work that has considered\ncapturing a distribution over multi-modal segmentations and has measured the agreement with such\na distribution. For fair comparison, we train the baseline models whose architectures are depicted\nin Fig. 2 in the exact same manner as we train ours. The baseline methods all involve the same\nU-Net architecture, i.e. they share the same core component and thus employ comparable numbers of\nlearnable parameters in the segmentation tasks.\nDropout U-Net (Fig. 2a). Our \u2018Dropout U-Net\u2019 baselines follow the Bayesian segnet\u2019s [7] proposi-\ntion: we dropout the activations of the respective incoming layers of the three inner-most encoder\nand decoder blocks with a dropout probability of p = 0.5 during training as well as when sampling.\n\n3An open source re-implementation of our approach can be found at https://github.com/SimonKohl/\n\nprobabilistic_unet.\n\n4\n\n\fFigure 2: Baseline architectures. Arrows: \ufb02ow of operations; blue blocks: feature maps; red blocks:\nfeature maps with dropout; green block broadcasted latents. Note that the number of feature\nmap blocks shown is reduced for clarity of presentation. (a) Dropout U-Net. (b) U-Net\nEnsemble. (c) M-Heads. (d) Image2Image VAE.\n\nU-Net Ensemble (Fig. 2b). We report results for ensembles with the number of members matching\nthe required number of samples (referred to as \u2018U-Net Ensemble\u2019). The original deterministic variant\nof the U-Net is the 1-sample corner case of an ensemble.\nM-Heads (Fig. 2c). Aiming for diverse semantic segmentation outputs, the works of [12] and [13]\npropose to branch off M heads after the last layer of a deep net each of which contributes one output\nvariant. An adjusted cross-entropy loss that adaptively assigns heads to ground-truth hypotheses is\nemployed to promote diversity while reducing the risk of idle heads: the loss of the best performing\nhead is weighted with a factor of 1 \u2212 \u0001, while the remaining heads each contribute with a weight of\n\u0001/(M \u2212 1) to the loss. For our \u2018M-Heads\u2019 baselines we again employ a U-Net core and set \u0001 = 0.05\nas proposed by [12]. In order to allow for the evaluation of 4, 8 and 16 samples, we train M-Heads\nmodels with the corresponding number of heads.\nImage2Image VAE (Fig. 2d). In [21] the authors propose a U-Net VAE-GAN hybrid for multi-\nmodal image-to-image translation, that owes its stochasticity to normal distributed latents that are\nbroadcasted and fed into the encoder path of the U-Net. In order to deal with the complex solution\nspace in image-to-image translation tasks, they employ an adversarial discriminator as additional\nsupervision alongside a reconstruction loss. In the fully supervised setting of semantic segmentation\nsuch an additional learning signal is however not necessary and we therefore train with a cross-entropy\nloss only. In contrast to our proposition, this baseline, which we refer to as the \u2018Image2Image VAE\u2019,\nemploys a prior that is not conditioned on the input image (a \ufb01xed normal distribution) and a posterior\nnet that is not conditioned on the input either.\nIn all cases we examine the models\u2019 performance when drawing a different number of samples (1, 4,\n8 and 16) from each of them.\n\n4 Results\n\nA quantitative evaluation of multiple segmentation predictions per image requires annotations from\nmultiple labelers. Here we consider two datasets: The LIDC-IDRI dataset [32, 33, 34] which contains\n4 annotations per input, and the Cityscapes dataset [35], which we arti\ufb01cially modify by adding\nsynonymous classes to introduce uncertainty in the way concepts are labelled.\n\n4.1 Lung abnormalities segmentation\n\nThe LIDC-IDRI dataset [32, 33, 34] contains 1018 lung CT scans from 1010 lung patients with\nmanual lesion segmentations from four experts. This dataset is a good representation of the typical\nambiguities that appear in CT scans. For each scan, 4 radiologists (from a total of 12) provided\nannotation masks for lesions that they independently detected and considered to be abnormal. We\nuse the masks resulting from a second reading in which the radiologists were shown the anonymized\nannotations of the others and were allowed to make adjustments to their own masks.\nFor our experiments we split this dataset into a training set composed of 722 patients, a validation set\ncomposed of 144 patients, and a test set composed of the remaining 144 patients. We then resampled\n\n5\n\n12m12m1,2,3,...U-NetNormal PriorSample (cid:6884)1,(cid:6884)2,(cid:6884)3,...z1z2z3abcd\fFigure 3: Qualitative results. The \ufb01rst row shows the input image and the ground truth segmentations.\nThe following rows show results from the baselines and from our proposed method. (a)\nlung CT scan from the LIDC test set. Ground truth: 4 graders. (b) Cityscapes. Images\ncropped to squares for ease of presentation. Ground truth: 32 arti\ufb01cial modes. Best viewed\nin colour.\n\nFigure 4: Comparison of approaches using the squared energy distance. Lower energy distances\ncorrespond to better agreement between predicted distributions and ground truth distribution\nof segmentations. The symbols that overlay the distributions of data points mark the mean\nperformance. (a) Performance on lung abnormalities segmentation on our LIDC-IDRI\ntest-set. (b) Performance on the of\ufb01cial Cityscapes validation set (our test set).\n\nthe CT scans to 0.5 mm \u00d7 0.5 mm in-plane resolution (the original resolution is between 0.461 mm\nand 0.977 mm, 0.688 mm on average) and cropped 2D images (180 \u00d7 180 pixels) centered at the\nlesion positions. The lesion positions are those where at least one of the experts segmented a lesion.\nBy cropping the scans, the resultant task is in isolation not directly clinically relevant. However, this\nallows us to ignore the vast areas in which all labelers agree, in order to focus on those where there is\nuncertainty. This resulted in 8882 images in the training set, 1996 images in the validation set and\n1992 images in the test set. Because the experts can disagree whether the lesion is abnormal tissue,\nup to 3 masks per image can be empty. Fig. 3a shows an example of such lesion-centered images and\nthe masks provided by 4 graders.\nAs all models share the same U-Net core component and for fairness and ease of comparability, we\nlet all models undergo the same training schedule, which is detailed in subsection H.1.\nIn order to grasp some intuition about the kind of samples produced by each model, we show in\nFig. 3a, as well as in Appendix F, representative results for the baseline methods and our proposed\nProbabilistic U-Net. Fig. 4a shows the squared generalized energy distance \u02c6D2\nGED for all models as a\nfunction of the number of samples. The data accumulations visible as horizontal stripes are owed\nto the existence of empty ground-truth masks. The energy distance on the 1992 images large lung\nabnormalities test set, decreases for all models as more samples are drawn indicating an improved\nmatching of the ground-truth distribution as well as enhanced sample diversity. Our proposed\n\n6\n\n\fProbabilistic U-Net outperforms all baselines when sampling 4, 8 and 16 times. The performance at\n16 samples is found signi\ufb01cantly higher than that of the baselines (p-value \u223c O(10\u221213)), according\nto the Wilcoxon signed-rank test. Finally, in Appendix E we show the results of an experiment\nregarding the capacity different models have to distinguish between unambiguous and ambiguous\ninstances (i.e. instances where graders disagree on the presence of a lesion).\n\n4.2 Cityscapes semantic segmentation\n\nAs a second dataset we use the Cityscapes dataset [35]. It contains images of street scenes taken\nfrom a car with corresponding semantic segmentation maps. A total of 19 different semantic classes\nare labelled. Based on this dataset we designed a task that allows full control of the ambiguities:\nwe create ambiguities by arti\ufb01cial random \ufb02ips of \ufb01ve classes to newly introduced classes. We \ufb02ip\n\u2018sidewalk\u2019 to \u2018sidewalk 2\u2019 with a probability of 8/17, \u2018person\u2019 to \u2018person 2\u2019 with a probability of\n7/17, \u2018car\u2019 to \u2018car 2\u2019 with 6/17, \u2018vegetation\u2019 to \u2018vegetation 2\u2019 with 5/17 and \u2018road\u2019 to \u2018road 2\u2019 with\nprobability 4/17. This choice yields distinct probabilities for the ensuing 25 = 32 discrete modes\nwith probabilities ranging from 10.9% (all un\ufb02ipped) down to 0.5% (all \ufb02ipped). The of\ufb01cial training\ndataset with \ufb01ne-grained annotation labels comprises 2975 images and the validation dataset contains\n500 images. We employ this of\ufb01cal validation set as a test set to report results on, and split off 274\nimages (corresponding to the 3 cities of Darmstadt, M\u00f6nchengladbach and Ulm) from the of\ufb01cial\ntraining set as our internal validation set. As in the previous experiment, in this task we use a similar\nsetting for the training processes of all approaches, which we present in detail in subsection H.2.\nFig. 3b shows samples of each approach in the comparison given one input image. In Appendix G\nwe show further samples of other images, produced by our approach. Fig. 4b shows that the\nProbabilistic U-Net on the Cityscapes task outperforms the baseline methods when sampling 4, 8 and\n16 times in terms of the energy distance. This edge in segmentation performance at 16 samples is\nhighly signi\ufb01cant according to the Wilcoxon signed-rank test (p-value \u223c O(10\u221277)). We have also\nconducted ablation experiments in order to explore which elements of our architecture contribute to\nits performance. These were (1) Fixing the prior, (2) Fixing the prior, and not using the context in the\nposterior and (3) Injecting the latent features at the beginning of the U-Net. Each of these variations\nresulted in a lower performance. Detailed results can be found in Appendix D.\n\nReproducing the segmentation probabilities.\nIn the Cityscapes segmentation task, we can pro-\nvide further analysis by leveraging our knowledge of the underlying conditional distribution that we\nhave set by design. In particular we compare the frequency with which every model predicts each\nmode, to the corresponding ground truth probability of that mode. To compute the frequency of each\nmode by each model, we draw 16 samples from that model for all images in the test set. Then we\ncount the number of those samples that have that mode as the closest (using 1-IoU as the distance\nfunction).\nIn Fig. 5 (and Figs. 8, 9, 10 in Appendix C) we report the mode-wise frequencies for all 32 modes\nin the Cityscape task and show that the Probabilistic U-Net is the only model in this comparison\nthat is able to closely capture the frequencies of a large combinatorial space of hypotheses including\nvery rare modes, thus supplying calibrated likelihoods of modes. The Image2Image VAE is the only\n\nFigure 5: Reproduction of the probabilities of the segmentation modes on the Cityscapes task. The\narti\ufb01cial \ufb02ipping of 5 classes results in 32 modes with different ground truth probability\n(x-axis). The y-axis shows the frequency of how often the model predicted this variant in\nthe whole test set. Agreement with the bisector line indicates calibration quality.\n\n7\n\n\u22123\u22122\u221210log(P)\u2212\u221e\u22123\u22122\u221210log(\u02c6P)Dropout U-Net\u22123\u22122\u221210log(P)U-Net Ensemble\u22123\u22122\u221210log(P)M-Heads\u22123\u22122\u221210log(P)Image2Image VAE\u22123\u22122\u221210log(P)Probabilistic U-Net\fmodel among competitors that picks up on all variants, but the frequencies are far off as can be seen\nin its deviation from the bisector line in blue. The other baselines perform worse still in that all of\nthem fail to represent modes and the modes they do capture do not match the expected frequencies.\n\n4.3 Analysis of the Latent Space\n\nThe embedding of the segmentation variants in a low-dimensional latent space allows a qualitative\nanalysis of the internal representation of our model. For a 2D or 3D latent space we can directly\nvisualize where the segmentation variants get assigned. See Appendix A for details.\n\n5 Discussion and conclusions\n\nOur \ufb01rst set of experiments demonstrates that our proposed architecture provides consistent segmen-\ntation maps that closely match the multi-modal ground-truth distributions given by the expert graders\nin the lung abnormalities task and by the combinatorial ground-truth segmentation modes in the\nCityscapes task. The employed IoU-based energy distance measures whether the models\u2019 individual\nsamples are both coherent as well as whether they are produced with the expected frequencies. It\nnot only penalizes predicted segmentation variants that are far away from the ground truth, but also\npenalizes missing variants. On this task the Probabilistic U-Net is able to signi\ufb01cantly outperform the\nconsidered baselines, indicating its capability to model the joint likelihood of segmentation variants.\nThe second type of experiments demonstrates that our model scales to complex output distributions\nincluding the occurrence of very rare modes. With 32 discrete modes of largely differing occurrence\nlikelihoods (0.5% to 10.9%), the Cityscapes task requires the ability to closely match complex data\ndistributions. Here too our model performs best and picks the segmentation modes very close to the\nexpected frequencies, all the way into the regime of very unlikely modes, thus defying mode-collapse\nand exhibiting excellent probability calibration. As an additional advantage our model scales to\nsuch large numbers of modes without requiring any prior assumptions on the number of modes or\nhypotheses.\nThe lower performance of the baseline models relative to our proposition can be attributed to\ndesign choices of these models. While the Dropout U-Net successfully models the pixel-wise data\ndistribution (Fig. 8a bottom right, in the Appendix), such pixel-wise mixtures of variants can not be\nvalid hypotheses in themselves (see Fig. 3). The U-Net Ensemble\u2019s members are trained independently\nand each of them can only learn the most likely segmentation variant as attested to by Fig. 8b. In\ncontrast to that the closely related M-Heads model can pick up on multiple discrete segmentation\nmodes, due to the joint training procedure that enables diversity. The training does however not\nallow to correctly represent frequencies and requires knowledge of the number of present variants\n(see Fig. 9a, in the Appendix). Furthermore neither the U-Net Ensemble, nor the M-Heads can deal\nwith the combinatorial explosion of segmentation variants when multiple aspects vary independently\nof each other. The Image2Image VAE shares similarities with our model, but as its prior is \ufb01xed\nand not conditioned on the input image, it can not learn to capture variant frequencies by allocating\ncorresponding probability mass to the respective latent space regions. Fig. 17 in the Appendix shows\na severe miss-calibration of variant likelihoods on the lung abnormalities task that is also re\ufb02ected in\nits corresponding energy distance. Furthermore, in this architecture, the latent samples are fed into\nthe U-Net\u2019s encoder path, while we feed in the samples just after the decoder path. This design choice\nin the Image2Image VAE requires the model to carry the latent information all the way through the\nU-Net core, while simultaneously performing the recognition required for segmentation, which might\nadditionally complicate training (see analysis in Appendix D). Beside that, our design choice of late\ninjection has the additional advantage that we can produce a large set of samples for a given image\nat a very low computational cost: for each new sample from the latent space only the network part\nafter the injection needs to be re-executed to produce the corresponding segmentation map (this bears\nsimilarity to the approach taken in [23], where a generative model is employed to model hand pose\nestimation).\nAside from the ability to capture arbitrary modes with their corresponding probability conditioned\non the input, our proposed Probabilistic U-Net allows to inspect its latent space. This is because\nas opposed to e.g. GAN-based approaches, VAE-like models explicitly parametrize distributions,\na characteristic that grants direct access to the corresponding likelihood landscape. Appendix A\ndiscusses how the Probabilistic U-Net chooses to structure its latent spaces.\n\n8\n\n\fCompared to aforementioned concurrent work for image-to-image tasks [22], our model disentangles\nthe prior and the segmentation net. This can be of particular relevance in medical imaging, where\nprocessing 3D scans is common. In this case it is desirable to condition on the entire scan, while\nretaining the possibility to process the scan tile by tile in order to be able to process large volumes\nwith large models with a limited amount of GPU memory.\nOn a more general note, we would like to remark that current image-to-image translation tasks only\nallow subjective (and expensive) performance evaluations, as it is typically intractable to assess the\nentire solution space. For this reason surrogate metrics such as the inception score based on the\nevaluation via a separately trained deep net are employed [36]. The task of multi-modal semantic\nsegmentation, which we consider here, allows for a direct and thus perhaps more meaningful manner\nof performance evaluation and could help guide the design of future generative architectures.\nAll in all we see a large \ufb01eld where our proposed Probabilistic U-Net can replace the currently applied\ndeterministic U-Nets. Especially in the medical domain, with its often ambiguous images and highly\ncritical decisions that depend on the correct interpretation of the image, our model\u2019s segmentation\nhypotheses and their likelihoods could 1) inform diagnosis/classi\ufb01cation probabilities or 2) guide\nsteps to resolve ambiguities. Our method could prove useful beyond explicitly multi-modal tasks, as\nthe inspectability of the Probabilistic U-Net\u2019s latent space could yield insights for many segmentation\ntasks that are currently treated as a uni-modal problem.\n\n6 Acknowledgements\n\nThe authors would like to thank Mustafa Suleyman, Trevor Back and the whole DeepMind team for\ntheir exceptional support, and Shakir Mohamed and Andrew Zisserman for very helpful comments\nand discussions. The authors acknowledge the National Cancer Institute and the Foundation for\nthe National Institutes of Health, and their critical role in the creation of the free publicly available\nLIDC/IDRI Database used in this study.\n\nReferences\n[1] Lee, S., Prakash, S.P.S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice\nlearning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems.\n(2016) 2119\u20132127\n\n[2] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 2nd international\n\nconference on Learning Representations (ICLR). (2013)\n\n[3] Jimenez Rezende, D., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference\nin deep generative models. In: Proceedings of the 31st International Conference on Machine Learning\n(ICML). (2014)\n\n[4] Kingma, D.P., Jimenez Rezende, D., Mohamed, S., Welling, M.: Semi-supervised learning with deep\n\ngenerative models. In: Neural Information Processing Systems (NIPS). (2014)\n\n[5] Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative\n\nmodels. In: Advances in Neural Information Processing Systems. (2015) 3483\u20133491\n\n[6] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation.\nIn: Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2015. Volume 9351 of\nLNCS., Springer (2015) 234\u2013241\n\n[7] Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: Model uncertainty in deep convolutional\n\nencoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015)\n\n[8] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In:\n\nAdvances in Neural Information Processing Systems. (2017) 5580\u20135590\n\n[9] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation\n\nusing deep ensembles. In: Advances in Neural Information Processing Systems. (2017) 6405\u20136416\n\n[10] Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured\n\noutputs. In: Advances in Neural Information Processing Systems. (2012) 1799\u20131807\n\n9\n\n\f[11] Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one:\n\nTraining a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314 (2015)\n\n[12] Rupprecht, C., Laina, I., DiPietro, R., Baust, M., Tombari, F., Navab, N., Hager, G.D.: Learning in an\nuncertain world: Representing ambiguity through multiple hypotheses. In: International Conference on\nComputer Vision (ICCV). (2017)\n\n[13] Ilg, E., \u00c7i\u00e7ek, \u00d6., Galesso, S., Klein, A., Makansi, O., Hutter, F., Brox, T.: Uncertainty estimates for\n\noptical \ufb02ow with multi-hypotheses networks. arXiv preprint arXiv:1802.07095 (2018)\n\n[14] Chen, C., Kolmogorov, V., Zhu, Y., Metaxas, D., Lampert, C.: Computing the m most probable modes of a\n\ngraphical model. In: Arti\ufb01cial Intelligence and Statistics. (2013) 161\u2013169\n\n[15] Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse m-best solutions in markov\n\nrandom \ufb01elds. In: European Conference on Computer Vision, Springer (2012) 1\u201316\n\n[16] Kirillov, A., Savchynskyy, B., Schlesinger, D., Vetrov, D., Rother, C.: Inferring m-best diverse labelings in\na single one. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1814\u20131822\n\n[17] Kirillov, A., Shlezinger, D., Vetrov, D.P., Rother, C., Savchynskyy, B.: M-best-diverse labelings for\nIn: Advances in Neural Information Processing Systems. (2015)\n\nsubmodular energies and beyond.\n613\u2013621\n\n[18] Kirillov, A., Shekhovtsov, A., Rother, C., Savchynskyy, B.: Joint m-best-diverse labelings as a parametric\n\nsubmodular minimization. In: Advances in Neural Information Processing Systems. (2016) 334\u2013342\n\n[19] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.\n\narXiv preprint (2017)\n\n[20] Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160\n\n(2016)\n\n[21] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal\n\nimage-to-image translation. In: Advances in Neural Information Processing Systems. (2017) 465\u2013476\n\n[22] Esser, P., Sutter, E., Ommer, B.: A variational u-net for conditional appearance and shape generation.\n\narXiv preprint arXiv:1804.04694 (2018)\n\n[23] Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: Disco nets: Dissimilarity coef\ufb01cients networks. In:\n\nAdvances in Neural Information Processing Systems. (2016) 352\u2013360\n\n[24] Rao, C.R.: Diversity and dissimilarity coef\ufb01cients: a uni\ufb01ed approach. Theoretical population biology\n\n21(1) (1982) 24\u201343\n\n[25] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.:\nbeta-vae: Learning basic visual concepts with a constrained variational framework. In: International\nConference on Learning Representations. (2017)\n\n[26] Bellemare, M.G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., Munos, R.:\nThe cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)\n\n[27] Salimans, T., Zhang, H., Radford, A., Metaxas, D.: Improving gans using optimal transport. arXiv preprint\n\narXiv:1803.05573 (2018)\n\n[28] Sz\u00e9kely, G.J., Rizzo, M.L.: Energy statistics: A class of statistics based on distances. Journal of statistical\n\nplanning and inference 143(8) (2013) 1249\u20131272\n\n[29] Klebanov, L.B., Bene\u0161, V., Saxl, I.: N-distances and their applications. Charles University in Prague, the\n\nKarolinum Press (2005)\n\n[30] Kosub, S.: A note on the triangle inequality for the jaccard distance. arXiv preprint arXiv:1612.02696\n\n(2016)\n\n[31] Lipkus, A.H.: A proof of the triangle inequality for the tanimoto distance. Journal of Mathematical\n\nChemistry 26(1-3) (1999) 263\u2013265\n\n[32] Armato, I., Samuel, G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Clarke,\nL.P.: Data from lidc-idri. the cancer imaging archive. http://doi.org/10.7937/K9/TCIA.2015.\nLO9QL9SX (2015)\n\n10\n\n\f[33] Armato, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle,\nD.R., Henschke, C.I., Hoffman, E.A., et al.: The lung image database consortium (lidc) and image database\nresource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics\n38(2) (2011) 915\u2013931\n\n[34] Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maf\ufb01tt, D.,\nPringle, M., et al.: The cancer imaging archive (tcia): maintaining and operating a public information\nrepository. Journal of digital imaging 26(6) (2013) 1045\u20131057\n\n[35] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele,\nB.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference\non computer vision and pattern recognition. (2016) 3213\u20133223\n\n[36] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for\n\ntraining gans. In: Advances in Neural Information Processing Systems. (2016) 2234\u20132242\n\n[37] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)\n\n11\n\n\f", "award": [], "sourceid": 3455, "authors": [{"given_name": "Simon", "family_name": "Kohl", "institution": "German Cancer Research Center (DKFZ)"}, {"given_name": "Bernardino", "family_name": "Romera-Paredes", "institution": "DeepMind"}, {"given_name": "Clemens", "family_name": "Meyer", "institution": "DeepMind"}, {"given_name": "Jeffrey", "family_name": "De Fauw", "institution": "DeepMind"}, {"given_name": "Joseph R.", "family_name": "Ledsam", "institution": "DeepMind"}, {"given_name": "Klaus", "family_name": "Maier-Hein", "institution": "German Cancer Research Center"}, {"given_name": "S. M. Ali", "family_name": "Eslami", "institution": "DeepMind"}, {"given_name": "Danilo", "family_name": "Jimenez Rezende", "institution": "Google DeepMind"}, {"given_name": "Olaf", "family_name": "Ronneberger", "institution": "DeepMind"}]}