{"title": "Region Mutual Information Loss for Semantic Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 11117, "page_last": 11127, "abstract": "Semantic segmentation is a fundamental problem in computer vision.\n\tIt is considered as a pixel-wise classification problem in practice,\n\tand most segmentation models use a pixel-wise loss as their optimization criterion.\n\tHowever, the pixel-wise loss ignores the dependencies between pixels in an image.\n\tSeveral ways to exploit the relationship between pixels have been investigated,\n\t\\eg, conditional random fields (CRF) and pixel affinity based methods.\n\tNevertheless, these methods usually require additional model\n\tbranches, large extra memories, or more inference time.\n\tIn this paper, we develop a region mutual information (RMI) loss\n\tto model the dependencies among pixels more simply and efficiently.\n\tIn contrast to the pixel-wise loss which treats the pixels as independent samples,\n\tRMI uses one pixel and its neighbour pixels to represent this pixel.\n\tThen for each pixel in an image,\n\twe get a multi-dimensional point that encodes the relationship between pixels,\n\tand the image is cast into a multi-dimensional distribution of\n\tthese high-dimensional points.\n\tThe prediction and ground truth thus can achieve high order consistency\n\tthrough maximizing the mutual information (MI) between their multi-dimensional distributions.\n\tMoreover, as the actual value of the MI is hard to calculate,\n\twe derive a lower bound of the MI and maximize the lower bound to maximize the\n\treal value of the MI.\n\tRMI only requires a few extra computational resources in the training stage,\n\tand there is no overhead during testing.\n\tExperimental results demonstrate\n\tthat RMI can achieve substantial and consistent improvements in performance on PASCAL VOC 2012 and CamVid datasets.\n\tThe code is available at \\url{https://github.com/ZJULearning/RMI}.", "full_text": "Region Mutual Information Loss for Semantic\n\nSegmentation\n\nShuai Zhao1, Yang Wang2, Zheng Yang3, Deng Cai1,4\u2217\n\n1State Key Lab of CAD&CG, College of Computer Science, Zhejiang University\n\n2School of Arti\ufb01cial Intelligence and Automation, Huazhong University of Science and Technology\n\nzhaoshuaimcc@gmail.com, wangyang_sky@hust.edu.cn, yangzheng@fabu.ai, dcai@zju.edu.cn\n\n4Alibaba-Zhejiang University Joint Institute of Frontier Technologies\n\n3Fabu Inc., Hangzhou, China\n\nAbstract\n\nSemantic segmentation is a fundamental problem in computer vision. It is con-\nsidered as a pixel-wise classi\ufb01cation problem in practice, and most segmentation\nmodels use a pixel-wise loss as their optimization criterion. However, the pixel-\nwise loss ignores the dependencies between pixels in an image. Several ways to\nexploit the relationship between pixels have been investigated, e.g., conditional\nrandom \ufb01elds (CRF) and pixel af\ufb01nity based methods. Nevertheless, these methods\nusually require additional model branches, large extra memories, or more inference\ntime. In this paper, we develop a region mutual information (RMI) loss to model\nthe dependencies among pixels more simply and ef\ufb01ciently. In contrast to the\npixel-wise loss which treats the pixels as independent samples, RMI uses one pixel\nand its neighbour pixels to represent this pixel. Then for each pixel in an image,\nwe get a multi-dimensional point that encodes the relationship between pixels, and\nthe image is cast into a multi-dimensional distribution of these high-dimensional\npoints. The prediction and ground truth thus can achieve high order consistency\nthrough maximizing the mutual information (MI) between their multi-dimensional\ndistributions. Moreover, as the actual value of the MI is hard to calculate, we\nderive a lower bound of the MI and maximize the lower bound to maximize the\nreal value of the MI. RMI only requires a few extra computational resources in\nthe training stage, and there is no overhead during testing. Experimental results\ndemonstrate that RMI can achieve substantial and consistent improvements in\nperformance on PASCAL VOC 2012 and CamVid datasets. The code is available\nat https://github.com/ZJULearning/RMI.\n\n1\n\nIntroduction\n\nSemantic segmentation is a fundamental problem in computer vision, and its goal is to assign semantic\nlabels to every pixel in an image. Recently, much progress has been made with powerful convolutional\nneural networks (e.g., VGGNet [33], ResNet [14], Xception [8]) and fancy segmentation models\n(e.g., FCN [23], PSPNet [40], SDN [11], DeepLab [5, 6, 7], ExFuse [39], EncNet [38]). These\nsegmentation approaches treat semantic segmentation as a pixel-wise classi\ufb01cation problem and solve\nit by minimizing the average pixel-wise classi\ufb01cation loss over the image. The most commonly used\npixel-wise loss for semantic segmentation is the softmax cross entropy loss:\n\nLce(y, p) = \u2212 1\nN\n\n\u2217Deng Cai is the corresponding author\n\nN(cid:88)\n\nC(cid:88)\n\nn=1\n\nc=1\n\nyn,c log(pn,c),\n\n(1)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An image region and its corresponding multi-dimensional point. Following the same\nstrategy, an image can be cast into a multi-dimensional distribution of many high dimensional points,\nwhich encode the relationship between pixels.\n\nwhere y \u2208 {0, 1} is the ground truth label, p \u2208 [0, 1] is the estimated probability, N denotes the\nnumber of pixels, and C represents the number of object classes. Minimizing the cross entropy\nbetween y and p is equivalent to minimizing their relative entropy, i.e., Kullback-Leibler (KL)\ndivergence [12].\nAs Eq. (1) shows, the softmax cross entropy loss is calculated pixel by pixel. It ignores the relationship\nbetween pixels. However, there are strong dependencies existed among pixels in an image and these\ndependencies carry important information about the structure of the objects [37]. Consequently,\nmodels trained with a pixel-wise loss may struggle to identify the pixel when its visual evidence is\nweak or when it belongs to objects with small spatial structures [18], and the performance of the\nmodel may be limited.\nSeveral ways to model the relationship between pixels have been investigated, e.g., Conditional\nRandom Field (CRF) based methods [5, 19, 32, 22, 41] and pixel af\ufb01nity based methods [18, 21, 25].\nNevertheless, CRF usually has time-consuming iterative inference routines and is sensitive to visual\nappearance changes [18], while pixel af\ufb01nity based methods require extra model branches to extract\nthe pixel af\ufb01nity from images or additional memories to hold the large pixel af\ufb01nity matrix. Due to\nthese factors, the top-performing models [11, 40, 6, 7, 39, 38] do not adopt these methods.\nIn this paper, we develop a region mutual information (RMI) loss for semantic segmentation to model\nthe relationship between pixels more simply and ef\ufb01ciently. Our work is inspired by the region mutual\ninformation for medical image registration [30]. The idea of RMI is intuitive, as shown in Fig. 1,\ngiven a pixel, if we use this pixel and its 8-neighbours to represent this pixel, we get a 9-dimensional\n(9-D) point. For an image, we can get many 9-D points, and the image is cast into a multi-dimensional\n(multivariate) distribution of these 9-D points. Each 9-D point also represents a small 3\u00d73 region,\nand the relationship between pixels is encoded in these 9-D points.\nAfter we get the two multi-dimensional distributions of the ground truth and the prediction given by\nthe segmentation model, our purpose is to maximize their similarity. The mutual information (MI)\nis a natural information-theoretic measure of the independence of random variables [17]. It is also\nwidely used as a similarity measure in the \ufb01eld of medical image registration [36, 24, 28, 30, 35].\nThus the prediction and ground truth can achieve higher order consistency through maximizing the\nMI between their multi-dimensional distributions than only using a pixel-wise loss. However, the\npixels in an image are dependent, which makes the multi-dimensional distribution of the image\nhard to analysis. This means calculating the actual value of the MI between such two undetermined\ndistributions becomes infeasible. So we derive a lower bound of the MI, then we can maximize this\nlower bound to maximize the real value of the MI between two distributions.\nWe adopt a downsampling strategy before constructing the multi-dimensional distributions of the\nprediction and ground truth. The goal is to reduce memory consumption, thus RMI only requires a\nfew additional computational resources during training. In this way, it can be effortlessly incorporated\ninto any existing segmentation frameworks without any changes to the base model. RMI also has no\nextra inference steps during testing.\n\n2\n\n\ud835\udc5d1\ud835\udc5d2\ud835\udc5d3\ud835\udc5d4\ud835\udc5d5\ud835\udc5d6\ud835\udc5d7\ud835\udc5d8\ud835\udc5d9\ud835\udc5d1\ud835\udc5d2\ud835\udc5d3\ud835\udc5d4\ud835\udc5d5\ud835\udc5d6\ud835\udc5d7\ud835\udc5d8\ud835\udc5d9=\ud835\udc5d1\ud835\udc432\ud835\udc5d3\ud835\udc5d4\ud835\udc5d5\ud835\udc5d6\ud835\udc5d7\ud835\udc5d8\ud835\udc5d9\ud835\udc661\ud835\udc662\ud835\udc663\ud835\udc664\ud835\udc665\ud835\udc666\ud835\udc667\ud835\udc668\ud835\udc669\ud835\udc661\ud835\udc662\ud835\udc663\ud835\udc664\ud835\udc665\ud835\udc666\ud835\udc667\ud835\udc668\ud835\udc669\ud835\udc661\ud835\udc662\ud835\udc663\ud835\udc664\ud835\udc665\ud835\udc666\ud835\udc667\ud835\udc668\ud835\udc669=TheirSimilarityMaximize\fExperimental results demonstrate that the RMI loss can achieve substantial and consistent improve-\nments in performance on PASCAL VOC 2012 [10] and CamVid [4] datasets. We also empirically\ncompare our method with some existing pixel relationship modeling techniques [19, 20] on these two\ndatasets, where RMI outperforms the others.\n\n2 Entropy and mutual information\nLet X be a discrete random variable with alphabet X , its probability mass function (PMF) is\np(x), x \u2208 X . For convenience, we denote p(x) and p(y) refer to two different random variables, and\nthey are actually two different PMFs, pX (x) and pY (y) respectively. The entropy of X is de\ufb01ned\nas [9]:\n\np(x) log p(x).\n\n(2)\n\nx\u2208X\n\nH(X) = \u2212(cid:88)\n(cid:88)\nH(X, Y ) = \u2212(cid:88)\n(cid:88)\nH(Y |X) = \u2212(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\nx\u2208X\n\ny\u2208Y\n\nThe entropy is a measure of the uncertainty of a random variable [9]. Then the joint entropy\nH(X, Y ) and conditional entropy H(Y |X) of a pair of discrete random variables (X, Y ) with a joint\ndistribution p(x, y) and a conditional distribution p(y|x) are de\ufb01ned as:\np(x, y) log p(x, y),\n\n(3)\n\np(x, y) log p(y|x).\n\n(4)\n\nIndeed, the entropy of a pair of random variables is the entropy of one plus the conditional entropy of\nthe other [9]:\n\n(5)\nWe now introduce mutual information, which is a measure of the amount of information that X and\nY contain about each other [9]. It is de\ufb01ned as:\n\nH(X, Y ) = H(X) + H(Y |X).\n\nI(X; Y ) =\n\np(x, y) log\n\np(x, y)\np(x)p(y)\n\n.\n\n(6)\n\n(cid:88)\n\n(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\nThe equation (6) suggests that I(X; Y ) is a very natural measure for dependence [17]. It can also be\nconsidered as the reduction in the uncertainty of X due to the knowledge of Y , and vice versa:\n\nI(X; Y ) = H(X) \u2212 H(X|Y ) = H(Y ) \u2212 H(Y |X).\n\n(7)\n\nSimilar de\ufb01nitions of continuous random variables can be found in [9, Chap.8].\n\n3 Methodology\n\n(cid:90)\n\n(cid:90)\n\nAs shown in Fig. 1, we can get two multivariate random variables, P = [p1, p2, . . . , pd]T and\nY = [y1, y2, . . . , yd]T , where P \u2208 Rd is the predicted probability, Y \u2208 Rd denotes the ground truth,\npi is in [0, 1], and yi is 0 or 1. If the square region size in Fig. 1 is R \u00d7 R, then d = R \u00d7 R. The\nprobability density functions (PDF) of P and Y are f (p) and f (y) respectively, their joint PDF is\nf (y, p). The distribution of P can also be considered as the joint distribution of p1, p2, . . . , pd, and it\nmeans f (p) = f (p1, p2, . . . , pd). The mutual information I(Y ; P ) is de\ufb01ned as [9]:\n\nf (y, p)\nf (y)f (p)\n\nY\n\nP\n\ndydp,\n\nf (y, p) log\n\nI(Y ; P ) =\n\n(8)\nwhere Y and P are the support sets of Y and P respectively. Our purpose is to maximize I(Y ; P )\nto achieve high order consistency between Y and P .\nTo get the mutual information, one straightforward way is to \ufb01nd out the above PDFs. However, the\nrandom variables p1, p2, . . . , pd are dependent as pixels in an image are dependent. This makes their\njoint density function f (p) hard to analyze. In [30], Russakoff et al.demonstrated that, for grayscale\nimages, Y and P are normally distributed when R is large enough, and it is well supported by the\nm-dependence variable concept proposed by Hoeffding et al. [16]. Nevertheless, in our situation,\nwe \ufb01nd that when Y and P are normally distributed, the side length R becomes very large, e.g.,\nR \u2265 30. Then the dimensions d is larger than 900, and memory consumption becomes extremely\nlarge. Thus the implementation of this method is unrealistic. Due to these factors, we derive a lower\nbound of I(Y ; P ) and maxmize the lower bound to maximize the actual value of I(Y ; P ).\n\n3\n\n\f3.1 A lower bound of mutual information\nFrom Eq. (7), we have I(Y ; P ) = H(Y )\u2212H(Y |P ). At the same time, we know that a normal distri-\nbution maximizes the entropy over all distributions with the same covariace [9, Theorem 8.6.5]. And\nthe entropy of a normal distribution with a covariance matrix \u03a3 \u2208 Rd\u00d7d is 1\nwhere det(\u00b7) is the determinant of the matrix. We can thus get a lower bound of the mutual information\nI(Y ; P ):\n\n2 log(cid:0)(2\u03c0e)d det(\u03a3)(cid:1),\n\nI(Y ; P ) = H(Y ) \u2212 H(Y |P )\n\nlog(cid:0)(2\u03c0e)d det(\u03a3Y |P )(cid:1).\n\n\u2265 H(Y ) \u2212 1\n2\n\n(9)\n\n(10)\n\nwhere \u03a3Y |P is the posterior covariance matrix of Y , given P . It is a symmetric positive semide\ufb01nite\nmatrix. This lower bound is also discussed in [29]. Following the commonly used cross entropy loss\n(Eq. (1)), we ignore the constant terms which are not related to the parameters in the model. Then we\nget a simpli\ufb01ed lower bound to maximize:\n\nlog(cid:0) det(\u03a3Y |P )(cid:1).\n\nIl(Y ; P ) = \u2212 1\n2\n\n3.2 An approximation of posterior variance\n\nAt this point, the key problem turns out to \ufb01nd out the posterior covariance matrix \u03a3Y |P . However,\nwe cannot get the exact \u03a3Y |P because we do not know PDFs of Y and P or their dependence.\nFortunately, Triantafyllopoulos et al. [34] have already given an approximation of the posterior\nvariance under a certain assumption in Bayesian inference.\nSuppose we need to estimate Y , given P . E(Y ) is the mean vector of Y (also \u00b5y), Var(Y )\nis the variance matrix of Y (also \u03a3Y ), and Cov(Y, P ) is the covariance matrix of Y and P .\nTriantafyllopoulos et al. [34] denote the notation Y \u22a52 P to indicate that Y and P are second order\nindependent, i.e., E(Y |P = p) = E(Y ) and Var(Y |P = p) = Var(Y ), for any value p of P . The\nsecond order independence is a weaker constraint than strict mutual independence. Furthermore, the\nregression matrix Ayp of Y on P is introduced, which is well known as Ayp = Cov(Y , P )\u03a3\u22121\nP . It is\neasy to \ufb01nd that Y \u2212 AypP and P are uncorrelated by calculating their linear correlation coef\ufb01cient.\nTo obtain the approxiamation of the posterior covariance matrix Var(Y |P = p), Triantafyllopoulos\net al. [34] assume that\n\n(11)\nThis assumption means that Var(Y \u2212 AypP|P = p) does not depend on the value p of P . Following\nthe property of covariance matrix and the de\ufb01nition of the second order independence, we can get:\n\n(Y \u2212 AypP ) \u22a52 P .\n\nVar(Y |P = p) = Var(Y \u2212 AypP|P = p)\n\n= Var(Y \u2212 AypP )\n= \u03a3Y \u2212 Cov(Y , P )(\u03a3\u22121\n\nP )T Cov(Y , P )T .\n\n(12)\nTheorem 3.1. Consider random vectors Y and P as above. Under quadratic loss, \u00b5y +Ayp(P \u2212\u00b5p)\nis the Bayes linear estimator if and only if (Y \u2212 AypP ) \u22a52 P .\nThe assumption Eq. (11) is supported by theorem 3.1 [34, Theorem 1]. The theorem suggests that if\none accepts the assumptions of Bayes linear optimality, the assumption (11) must be employed. Under\nthe assumption (11), we can get the linear minimum mean squared error (MMSE) estimator. This\nalso demonstrates that the difference between the approximation of posterior variance (Eq. (12)) and\nthe real posterior variance is restricted in a certain range. Otherwise, E(Y |P ) = \u00b5y + Ayp(P \u2212 \u00b5p)\ncannot be the linear MMSE estimator. Theoretic proof of theorem 3.1 and some examples are given\nin [34]. Now, we can get an approximation of Eq. (10):\n\n(cid:16)\n\ndet(cid:0)\u03a3Y \u2212 Cov(Y , P )(\u03a3\u22121\n\nP )T Cov(Y , P )T(cid:1)(cid:17)\n\nIl(Y ; P ) \u2248 \u2212 1\n2\n\nlog\n\n(13)\nP )T Cov(Y , P )T , where M \u2208 Rd\u00d7d and it is a\n\n.\n\nFor brevity, we set M = \u03a3Y \u2212 Cov(Y , P )(\u03a3\u22121\npositive semide\ufb01nite matrix cause it is a covariance matrix of (Y \u2212 AypP ).\n\n4\n\n\f4\n\nImplementation details\n\nIn this section, we will discuss some devil details about implementing RMI in practice.\nDownsampling As shown in Fig. 1, we choose pixels in a square region of size R\u00d7R to construct\na multi-dimensional distribution. If R = 3, this leads to 9 times memory consumption. For a \ufb02oat\ntensor with shape [16, 513, 513, 21], its original memory usage is about 0.33GB, and this usage turns\nto be about 9 \u00d70.33 = 2.97GB with RMI. This also means more \ufb02oating-point operations. We cannot\nafford such a large computational resource cost, so we downsample the ground truth and predicted\nprobability to save resources with little sacri\ufb01ce of the performance.\n\nNormalization From the Eq. (13), we can get log(cid:0) det(M )(cid:1) = (cid:80)d\n\ni=1 log \u03bbi, where \u03bb is the\neigenvalues of M. It is easy to see that the magnitude of the Il(Y ; P ) is very likely related to the\nnumber of eigenvalues of M. To normalize the value of Il(Y ; P ), we divide it by d:\n\nlog(cid:0) det(M )(cid:1).\n\nIl(Y ; P ) \u2248 \u2212 1\n2d\n\n(14)\n\nUnder\ufb02ow issue The magnitude of the probabilities given by softmax or sigmoid operations may\nbe very small. Meanwhile, the number of points may be very large, e.g., there are about 263 000\npoints in a label map with size 513 \u00d7 513. Therefore, when we use the formula Cov(Y , Y ) =\n\nE(cid:0)(Y \u2212 \u00b5y)(Y \u2212 \u00b5y)T(cid:1) to calculate the covariance matrix, some entries in the matrix will have\n\nextremely small values, and we may encounter under\ufb02ow issues when calculating the determinant of\nthe covariance matrix in Eq. (14). So we rewrite the Eq. (14) as [27, Page59]:\n\nIl(Y ; P ) \u2248 \u2212 1\n2d\n\n(15)\nwhere Tr(\u00b7) is the trace of a matrix. Furthermore, M is a symmetric positive semide\ufb01nite matrix. In\npractice, we add a small positve constant to the diagnoal entries of M, then we get M = M + \u03beI,\nwhere \u03be to be 1e \u2212 6 in practice. This has little effect on the optima of the system, but we can\naccelerate the computation of Eq. (15) by performing Cholesky decomposition of M, as M a\nsymmetric positive de\ufb01nite matrix now. This is already supported by PyTorch [26] and Tensor\ufb02ow [1].\nMoreover, double-precision \ufb02oating-point numbers are used to ensure computational accuracy when\n\ncalculating the value of RMI. It is also worth noting that log(cid:0) det(M )(cid:1) is concave when M is\n\npositive de\ufb01nite [3], which makes RMI easy to optimize.\n\nTr(cid:0) log(M )(cid:1),\n\nOverall objective function The overall objective function used for training the model is:\n\nLall(y, p) = \u03bbLce(y, p) + (1 \u2212 \u03bb)\n\n1\nB\n\nB(cid:88)\n\nC(cid:88)\n\nb=1\n\nc=1\n\n(cid:0) \u2212 I b,c\n\n(Y ; P )(cid:1),\n\nl\n\n(16)\n\nwhere \u03bb \u2208 [0, 1] is a weight factor, Lce(y, p) is the normal cross entropy loss between y and p, B\ndenotes the number of images in a mini-batch, and the maximization of RMI is cast as a minimization\nproblem. The role of the normal cross entropy loss is a measure of the similarity between the pixel\nintensity of two images, and RMI can be considered as a measure of the structural similarity between\ntwo images. Following the structural similarity (SSIM) index [37] , the importance of pixel similarity\nand structural similarity is considered equally, so we simply set \u03bb = 0.5.\nWe adopt the sigmoid operation rather than softmax operation to get predicted probabilities. This is\nbecause RMI is calculated channel-wise, we do not want to introduce interference between channels.\nExperimental results demonstrate the performance of models trained with softmax and sigmoid cross\nentropy losses is roughly the same.\n\n5 Experiments\n\n5.1 Experimental setup\n\nBase model. We choose the DeepLabv3 [6] and DeepLabv3+ [7] as our base models. DeepLabv3+\nmodel adds a decoder module to DeepLabv3 model to re\ufb01ne the segmentation results. The backbone\n\n5\n\n\fnetwork is ResNet-101 [14], and the \ufb01srt one 7\u00d77 convolutional layer is replaced with three 3\u00d73\nconvolutional layers [6, 7, 15]. The backbone network is only pretrained on ImageNet [31]2.\nDatasets. We evaluate our method on two dataset, PASCAL VOC 2012 [10] and CamVid [4] datasets.\nPASCAL VOC 2012 dataset contains 1 464 (train), 1 449 (val), and 1 456 (test) images. It contains\n20 foreground object classes and one background class. CamVid dataset is a street scene dataset,\nwhich contains 367 for training, 101 for validation, and 233 for testing. We use the resized version of\nCamVid dataset provided by SegNet [2]. It contains 11 object classes and an unlabelled class, and\nthe image size is 480\u00d7360.\nLearning rate and training steps. The warm up learning rate strategy introduced in [15] and the\npoly learning rate policy are adopted. If the initial learning rate is lr and the current iteration step is\niter, for the \ufb01rst slow_iters steps, the learning rate is lr \u00d7 iter\nslow_iters, and for the rest of the steps, the\nlearning rate is lr \u00d7 (1 \u2212 iter\u2212slow_iters\nmax_iter\u2212slow_iters )power with power = 0.9. Here max_iter is the maximum\ntraining steps. For PASCAL VOC 2012 dataset, the model is trained on the trainaug [13] set which\ncontains 10 582 images, max_iter is about 30K, lr = 0.007, and slow_iters = 1.5K. For CamVid\ndataset, we train the model on the train and validation sets, max_iter is about 6K, lr = 0.025, and\nslow_iters = 300.\nCrop size and output stride. During training, the batch size is always 16. The crop size is 513 and\n479 for PASCAL VOC 2012 and CamVid datasets respectively. The output stride, which is the ratio\nof input image spatial resolution to \ufb01nal output resolution, is always 16 during training and inference.\nWhen calculating the loss, we upscale the logits (output of the model before softmax or sigmoid\noperations) back to the input image resolution rather than downsampling it [6].\nData augmentation. We apply data augmentation by randomly scaling the input images and ran-\ndomly left-right \ufb02ipping during training. The random scale is in [0.5, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0]\non PASCAL VOC 2012 dataset and 0.75 \u223c 1.25 on CamVid dataset. Then we standardly normalized\nthe data so that they will have zero mean and one variance.\nInference strategy and evaluation metric. During inference, we use the original image as the\ninput of the model, and no special inference strategy is applied. The evaluation metric is the mean\nintersection-over-union (mIoU) score. The rest settings are the same as DeepLabv3+ [7].\n\n5.2 Methods of comparison\n\nWe compare RMI with CRF [19] and the af\ufb01nity \ufb01eld loss [18] experimentally. Both two methods\nalso try to model the relationship between pixels in an image to get better segmentation results.\nCRF. CRF [19] tries to model the relationship between pixels and enforce the predictions of pixels\nwhich have similar visual appearances to be more consistent. Following DeepLabv2 [5], we use\nit as a post-processing step. The negative logarithm of the predicted probability is used as unary\npotential. We reproduce the CRF according to its of\ufb01cal code3 and python wrapper4. Some important\nparameters \u03b8\u03b1, \u03b8\u03b2, and \u03b8\u03b3 in [19, Eq.(3)] are set to be 30, 13, and 3 respectively as recommended.\nOther settings are the default. As CRF has time-consuming inference steps, it is unacceptable in\nsome situations, e.g., a real-time application scenario.\nAf\ufb01nity \ufb01eld loss. The af\ufb01nity \ufb01eld loss [18] exploits the relationship between pairs of pixels. For\npaired neighbour pixels which belong to the same class, the loss imposes a grouping force on the pixel\npairs to make their predictions more consistent. As for the paired neighbour pixels which belong to\ndifferent classes, af\ufb01nity \ufb01eld loss imposes a separating force on them to make their predictions more\ninconsistent. The af\ufb01nity \ufb01eld loss adopts an 8-neighbour strategy, so it requires a large memory to\nhold 8\u00d7 ground truth label and predicted probabilities when calculating its value. To overcome this\nproblem, Ke et al. [18] downsample the label when calculating the loss. However, this may hurt the\nperformance of the model [6]. We reproduce the loss according to its of\ufb01cial implementation5. When\nchoosing the neighbour pixels in a square region, we set the region size to be 3 \u00d7 3 as suggested.\n\n2https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/models/\n\nmodel_zoo.py\n\n3http://www.philkr.net/2011/12/01/nips/\n4https://github.com/lucasb-eyer/pydensecrf\n5https://github.com/twke18/Adaptive_Affinity_Fields\n\n6\n\n\fTable 1: Evaluation of RMI on PASCAL VOC 2012 val set. The CE and BCE are softmax and\nsigmoid cross entropy loss respectively. The data of method CE\u2217 is the experimental data with\nsimilar settings (output stride = 16, only ImageNet pretraining, and no special inference strategies) to\nours reported in [6, 7]. CRF-X means that we do inference with X iteration steps when employing\nCRF. The Inf. Time is the average inference time per image during testing. As CRF is used as\na post-processing step, the additional time with various base models is the same. Here we apply\nRMI after downsampling the prediction and ground truth through average pooling, whose kernel\nsize is 4 \u00d7 4 and stride is 4. The square region size is 3 \u00d7 3, which means the dimension of the\nmulti-dimensional points is 9.\n\n(a) ResNet101-DeepLabv3\n\n(b) ResNet101-DeepLabv3+\n\nDeepLabv3\n\nMethod\n\nCE\u2217 [6]\nCE\nBCE\nCE & CRF-1\nCE & CRF-5\nCE & CRF-10\nAf\ufb01nty\nRMI\n\nInf. Time mIoU (%)\nunknown\n0.12 s\n0.12 s\n0.39 s\n0.71 s\n1.11 s\n0.12 s\n0.12 s\n\n77.21\n77.14\n77.09\n78.32\n78.40\n78.28\n76.24\n78.71\n\nMethod\n\nDeepLabv3+\n\nCE\u2217 [7]\nCE\nBCE\nCE & CRF-1\nCE & CRF-5\nCE & CRF-10\nAf\ufb01nty\nRMI\n\nmIoU (%)\n78.85\n78.17\n77.41\n78.90\n78.75\n78.60\n77.09\n79.66\n\nTable 2: Per-class results on PASCAL VOC 2012 test set.\nbike\n\nboat bottle\n\nhorse mbike person p.plant sheep sofa\n\ndog\n\ncar\n\nMethod\n\nDeepLabv3\n\nDeepLabv3+\n\nbus\n\nbird\n\nbackg. aero.\nCE\n94.10 79.58 41.16 84.67 67.68 75.09 87.69 87.40 92.07 39.66 83.39 69.68 86.67 87.10 86.92 84.39\nCRF-5 94.60 84.28 41.83 88.00 68.81 76.56 87.69 87.90 93.79 40.35 84.92 70.26 88.84 89.22 87.39 85.73\n94.57 84.77 41.67 89.99 69.11 77.86 90.02 90.17 93.14 42.97 85.70 64.74 87.45 86.63 88.25 87.04\nRMI\nCE\n94.37 90.03 42.40 82.07 70.46 75.77 93.36 88.07 90.70 36.50 86.50 67.17 86.04 90.18 87.23 85.02\nCRF-1 94.57 92.13 42.48 83.25 71.07 76.61 93.47 87.96 91.45 36.82 87.04 67.21 87.28 90.87 87.63 85.86\n94.97 91.57 42.93 93.72 74.84 76.23 93.68 89.09 93.59 41.99 87.63 68.79 88.23 91.33 87.12 88.62\nRMI\n\ncow d.table\n\nchair\n\ncat\n\ntrain\n\ntv mIoU (%)\n\n65.69 86.66 57.39 75.28 75.94\n67.45 87.95 58.80 75.31 77.38\n68.78 90.42 59.13 79.67 78.05\n68.36 88.46 57.34 84.13 78.62\n69.22 89.23 58.04 84.43 79.46\n70.24 92.00 57.77 82.53 76.60\n\n76.58\n77.96\n78.58\n78.23\n78.86\n80.16\n\n5.3 Results on PASCAL VOC 2012 dataset\n\n5.3.1 Effectiveness of RMI\n\nRMI is \ufb01rst evaluated on PASCAL VOC 2012 val set to demonstrated its effectiveness. The results on\nval set are shown in Tab. 1, With DeepLabv3 and DeepLabv3+ as base models, the RMI can improve\nthe mIoU score by 1.57% and 1.49% on the val set respectively. From Tab. 1, we can also see that\nthe reproduced models with RMI outperform the of\ufb01cial models with similar settings to ours, and\nthe DeepLabv3 model with RMI can catch the performance of the DeepLabv3+ model without RMI.\nRMI can also achieve consistent improvements with different models, while the improvements of\nCRF decrease when the base model is more powerful.\nIn Tab. 1, RMI outperforms the CRF and af\ufb01nity \ufb01eld loss, while it has no extra inference steps.\nSince we downsample the ground truth and probability before we construct the multi-dimensional\ndistributions, RMI only requires a few additional memory. Roughly saying, if we use double-precision\n\ufb02oating-point numbers, it needs 2 \u00d7 3\u00d73\n4\u00d74 = 1.125 times memory than the original memory usage.\nFor a ground truth map with shape [16, 513, 513, 21], theoretically, the additional memory is only\nabout 0.37GB. This is less than the memory that af\ufb01nity \ufb01eld loss consumed.\nWe evaluate some models in Tab. 1 on PASCAL VOC 2012 test set and the per-class results are shown\nin Tab. 2. With DeepLabv3 and DeepLabv3+, the improvements are 2.00% and 1.93% respectively.\nModels trained with RMI show better generalization ability on test set than on val set.\nSome selected qualitative results on val set are shown in Fig. 2. Segmentation results of\nDeepLabv3+&RMI have more accurate boundaries and richer details than results of DeepLabv3+&CE.\nThis demonstrates that RMI can de\ufb01nitely capture the relationship between pixels in an image. Thus\npredictions of the model with RMI have better visual qualities.\n\n5.3.2 Ablation study\nIn this section, we study the in\ufb02uence of downsampling manner, downsampling factor (DF, the\nimage size after downsampling is the original image size divided by DF), and square region size\n(R \u00d7 R) on PASCAL VOC 2012 val set. The results are shown in Tab. 3.\n\n7\n\n\fIn Tab. 3a, RMI with average pooling gets the best performance, this may because the average pooling\nreserves most information after downsampling. Max pooling and nearest interpolation both abandon\nsome points in an image during downsampling. When DF increase, the performance of RMI with the\nsame region size is likely to be hurt due to the lack of image details. The larger the DF, the smaller\nthe downsampled image size, and the more image details are lost. However, semantic segmentation is\na \ufb01ne-grained classi\ufb01cation task, so a large DF is not a good choice.\nIn Tab. 3b, we show the in\ufb02uence of the square region size R \u00d7 R, which is also the dimension of\nthe point in the multi-dimension distribution. Here RMI with a large R and a small DF is more\nlikely to get better performance. The smaller the downsampling factor DF and the larger the side\nlength R, the greater the computational resource consumption. The table 3b also shows that RMI\nwith R > 1 gets better result than RMI with R = 1. This further demonstrates that RMI bene\ufb01ts\nfrom the relationship between pixels in an image.\nThere is a trade-off between the performance of RMI and its cost, i.e., GPU memory and \ufb02oating-point\noperations. The performance of RMI can be further improved when computational resources are\nsuf\ufb01cient, e.g., DeepLabv3 with RMI (DF = 2, R = 3) in Tab. 3a achieve 79.23% on PASCAL\nVOC 2012 val set, which outperforms the 78.85% of improved DeepLabv3+ model reported in [7]\nwith simlilar experimental settings. Limited by our GPU memory, we did not test RMI without\ndownsampling.\n\nTable 3: In\ufb02uence of different components in RMI. The table 3a shows the effect of the downsampling\nmanner and factor. Avg. is average pooling, Max. is max pooling, and Int. is interpolation. Here we\nuse nearest interpolation for ground truth label and bilinear interpolation for predicted probabilities.\nThe table 3b shows impact of the square region size R \u00d7 R.\n\n(a) Downsampling manner and factor\n\n(b) Square region size\n\nDeepLabv3\n\nMethod\n\nRMI - Int.\nRMI - Max.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\n\nDF R mIoU (%)\n4\n4\n4\n2\n3\n4\n5\n6\n\n77.58\n78.62\n78.71\n79.23\n79.01\n78.71\n78.50\n78.28\n\n3\n3\n3\n3\n3\n3\n3\n3\n\nDeepLabv3\n\nMethod\n\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\nRMI - Avg.\n\nDF R mIoU (%)\n4\n4\n4\n4\n6\n6\n6\n6\n\n78.25\n78.52\n78.71\n78.65\n78.06\n78.28\n78.33\n78.26\n\n1\n2\n3\n4\n2\n3\n4\n5\n\n5.4 Results on CamVid dataset\n\nIn this section, we evaluate RMI on CamVid dataset to further demonstrate its general applicability.\nThe results are shown in Tab. 4. With DeepLabv3 and DeepLabv3+ as base models, RMI can improve\nthe mIoU score by 2.61% and 2.24% on CamVid test set respectively.\nFrom the Tab. 4, we can see that af\ufb01nity \ufb01eld loss works better on CamVid dataset than on PASCAL\nVOC 2012 dataset, while CRF behaves oppositely. This suggests that these two methods may not\nbe widely applicable enough. On the contrary, RMI achieves substantial improvement on CamVid\ndataset as well as PASCAL VOC 2012 dataset. This veri\ufb01es the broad applicability of the RMI loss.\n\nTable 4: Per-class results on CamVid test set. When applying RMI, we employ DF = 4 and R = 3.\nWe choose the best mIoU score from results of CRF with steps 1, 5, and 10.\n\nMethod\n\nDeepLabv3\n\nDeepLabv3+\n\nCE\nCE & CRF\nAf\ufb01nity\nRMI\n\nCE\nCE & CRF\nAf\ufb01nity\nRMI\n\nsky\n88.06\n90.64\n88.23\n89.29\n\n90.86\n91.88\n90.23\n90.91\n\nbuilding\n79.03\n80.48\n78.85\n79.89\n\n80.28\n80.92\n81.03\n80.50\n\npole\n10.12\n4.70\n11.54\n11.64\n27.62\n15.98\n29.83\n28.68\n\nroad\n91.32\n91.49\n92.77\n92.04\n\n93.60\n93.20\n93.93\n94.16\n\npavement\n\n73.21\n73.64\n76.50\n75.62\n\n78.57\n78.14\n80.86\n81.28\n\ntree\n72.65\n74.35\n71.96\n73.39\n\n74.35\n74.89\n74.75\n73.77\n\nsign symbol\n\n41.04\n39.93\n45.41\n48.02\n44.51\n42.14\n47.21\n52.08\n\nfence\n39.52\n41.98\n39.27\n43.25\n38.86\n39.72\n41.50\n42.72\n\ncar\n79.35\n80.45\n82.65\n82.01\n\n84.35\n84.50\n85.12\n83.22\n\npedestrian\n\n42.43\n40.50\n43.26\n46.74\n51.80\n40.74\n54.97\n57.75\n\nbicyclist mIoU (%)\n54.51\n54.20\n54.14\n59.67\n57.09\n56.04\n60.92\n62.57\n\n58.50\n58.59\n59.60\n61.11\n62.82\n61.68\n64.42\n65.06\n\n8\n\n\f6 Conclusion\n\nIn this work, we develop a region mutual information (RMI) loss to model the relationship between\npixels in an image. RMI uses one pixel and its neighbour pixels to represent this pixel. In this\ncase, we get a multi-dimensional point constructed by a small region in an image, and the image is\ncast into a multi-dimensional distribution of these high-dimensional points. Then the prediction of\nthe segmentation model and ground truth can achieve high order consistency through maximizing\nthe mutual information (MI) between their multi-dimensional distributions. However, it is hard to\ncalculate the value of the MI directly, so we derive a lower bound of the MI and maximize the lower\nbound to maximize the actual value of the MI. The idea of RMI is intuitive, and it is also easy to\nuse since it only requires a few additional memory during the training stage. Meanwhile, it needs\nno changes to the base segmentation model. We experimentally demonstrate that RMI can achieve\nsubstantial and consistent improvements in performance on standard benchmark datasets.\n\nFigure 2: Some selected qualitative results on PASCAL VOC 2012 val set. Here we use the\nDeepLab3+ models in Tab. 1. Segmentation results of DeepLabv3+&RMI have richer details than\nDeepLabv3+&CE, e.g., small bumps of the airplane wing, branches of plants, limbs of cows and\nsheep, and so on. Best view in color with 300% zoom.\n\n9\n\nimageground truthDeepLabv3+&CEDeepLabv3+&RMI\fAcknowledgments\n\nThis work was supported in part by The National Key Research and Development Program of China\n(Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant\nNos: 61936006).\n\nReferences\n\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,\nRajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,\nPete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In OSDI, 2016.\n\n[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional\n\nencoder-decoder architecture for image segmentation. PAMI, 2017.\n\n[3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[4] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A\n\nhigh-de\ufb01nition ground truth database. Pattern Recognition Letters, 2008.\n\n[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.\nDeeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and\nfully connected crfs. PAMI, 2018.\n\n[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous\n\nconvolution for semantic image segmentation. CoRR, abs/1706.05587, 2017.\n\n[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.\nEncoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV,\n2018.\n\n[8] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR,\n\n2017.\n\n[9] Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.\n[10] Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew\n\nZisserman. The pascal visual object classes (VOC) challenge. IJCV, 2010.\n\n[11] Jun Fu, Jing Liu, Yuhang Wang, and Hanqing Lu. Stacked deconvolutional network for semantic\n\nsegmentation. CoRR, abs/1708.04943, 2017.\n\n[12] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.\n[13] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik.\n\nSemantic contours from inverse detectors. In ICCV, 2011.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[15] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for\n\nimage classi\ufb01cation with convolutional neural networks. CoRR, abs/1812.01187, 2018.\n\n[16] Wassily Hoeffding and Herbert Robbins. The central limit theorem for dependent random\n\nvariables. Duke Mathematical Journal, 1948.\n\n[17] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications.\n\nNeural Networks, 2000.\n\n[18] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X. Yu. Adaptive af\ufb01nity \ufb01elds for\n\nsemantic segmentation. In ECCV, 2018.\n\n[19] Philipp Kr\u00e4henb\u00fchl and Vladlen Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian\n\nedge potentials. In NeurIPS, 2011.\n\n[20] Di Lin, Yuanfeng Ji, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Multi-scale context\n\nintertwining for semantic segmentation. In ECCV, 2018.\n\n10\n\n\f[21] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz.\n\nLearning af\ufb01nity via spatial propagation networks. In NeurIPS, 2017.\n\n[22] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning markov\n\nrandom \ufb01eld for semantic segmentation. PAMI, 2018.\n\n[23] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\n\nsegmentation. In CVPR, 2015.\n\n[24] Frederik Maes, Andr\u00e9 Collignon, Dirk Vandermeulen, Guy Marchal, and Paul Suetens. Multi-\nmodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging,\n1997.\n\n[25] Michael Maire, Takuya Narihira, and Stella X. Yu. Af\ufb01nity CNN: learning pixel-centric pairwise\n\nrelations for \ufb01gure/ground embedding. In CVPR, 2016.\n\n[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NeurIPS-W, 2017.\n\n[27] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. Version 20121115.\n[28] Josien P. W. Pluim, J. B. Antoine Maintz, and Max A. Viergever. Mutual information based\n\nregistration of medical images: A survey. IEEE Trans. Med. Imaging, 2003.\n\n[29] Sudhakar Prasad. Certain relations between mutual information and \ufb01delity of statistical\n\nestimation. CoRR, abs/1010.1508, 2010.\n\n[30] Daniel B. Russakoff, Carlo Tomasi, Torsten Rohl\ufb01ng, and Calvin R. Maurer Jr. Image similarity\n\nusing mutual information of regions. In ECCV, 2004.\n\n[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei\nLi. Imagenet large scale visual recognition challenge. IJCV, 2015.\n\n[32] Falong Shen, Rui Gan, Shuicheng Yan, and Gang Zeng. Semantic segmentation via structured\n\npatch prediction, context crf and guidance crf. In CVPR, 2017.\n\n[33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. CoRR, abs/1409.1556, 2014.\n\n[34] K. Triantafyllopoulos and P.J. Harrison. Posterior mean and variance approximation for regres-\n\nsion and time series problems. Statistics, 2008.\n\n[35] Max A. Viergever, J. B. Antoine Maintz, Stefan Klein, Keelin Murphy, Marius Staring, and\nJosien P. W. Pluim. A survey of medical image registration - under review. Medical Image\nAnalysis, 2016.\n\n[36] Paul A. Viola and William M. Wells III. Alignment by maximization of mutual information.\n\nIJCV, 1997.\n\n[37] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from\n\nerror visibility to structural similarity. TIP, 2004.\n\n[38] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi,\n\nand Amit Agrawal. Context encoding for semantic segmentation. In CVPR, 2018.\n\n[39] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing\n\nfeature fusion for semantic segmentation. In ECCV, 2018.\n\n[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene\n\nparsing network. In CVPR, 2017.\n\n[41] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,\nDalong Du, Chang Huang, and Philip H. S. Torr. Conditional random \ufb01elds as recurrent neural\nnetworks. In ICCV, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5955, "authors": [{"given_name": "Shuai", "family_name": "Zhao", "institution": "Zhejiang University"}, {"given_name": "Yang", "family_name": "Wang", "institution": "Huazhong University of Science and Technology"}, {"given_name": "Zheng", "family_name": "Yang", "institution": "FABU"}, {"given_name": "Deng", "family_name": "Cai", "institution": "ZJU"}]}