{"title": "Self-Erasing Network for Integral Object Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 549, "page_last": 559, "abstract": "Recently, adversarial erasing for weakly-supervised object attention has been deeply studied due to its capability in localizing integral object regions. However, such a strategy raises one key problem that attention regions will gradually expand to non-object regions as training iterations continue, which significantly decreases the quality of the produced attention maps. To tackle such an issue as well as promote the quality of object attention, we introduce a simple yet effective Self-Erasing Network (SeeNet) to prohibit attentions from spreading to unexpected background regions. In particular, SeeNet leverages two self-erasing strategies to encourage networks to use reliable object and background cues for learning to attention. In this way, integral object regions can be effectively highlighted without including much more background regions. To test the quality of the generated attention maps, we employ the mined object regions as heuristic cues for learning semantic segmentation models. Experiments on Pascal VOC well demonstrate the superiority of our SeeNet over other state-of-the-art methods.", "full_text": "Self-Erasing Network for Integral Object Attention\n\nQibin Hou\n\nPeng-Tao Jiang\n\nColledge of Computer Science, Nankai University\n\nandrewhoux@gmail.com\n\nYunchao Wei\n\nUIUC\n\nUrbana-Champaign, IL, USA\n\nColledge of Computer Science, Nankai University\n\nMing-Ming Cheng \u2217\n\ncmm@nankai.edu.cn\n\nAbstract\n\nRecently, adversarial erasing for weakly-supervised object attention has been\ndeeply studied due to its capability in localizing integral object regions. However,\nsuch a strategy raises one key problem that attention regions will gradually expand\nto non-object regions as training iterations continue, which signi\ufb01cantly decreases\nthe quality of the produced attention maps. To tackle such an issue as well as\npromote the quality of object attention, we introduce a simple yet effective Self-\nErasing Network (SeeNet) to prohibit attentions from spreading to unexpected\nbackground regions. In particular, SeeNet leverages two self-erasing strategies\nto encourage networks to use reliable object and background cues for learning to\nattention. In this way, integral object regions can be effectively highlighted without\nincluding much more background regions. To test the quality of the generated\nattention maps, we employ the mined object regions as heuristic cues for learning\nsemantic segmentation models. Experiments on Pascal VOC well demonstrate the\nsuperiority of our SeeNet over other state-of-the-art methods.\n\n1\n\nIntroduction\n\nSemantic segmentation aims at assigning each pixel a label from a prede\ufb01ned label set given a\nscene. For fully-supervised semantic segmentation [21, 4, 40, 41, 20], the requirement of large-scale\npixel-level annotations considerably limits its generality [3]. Some weakly-supervised works attempt\nto leverage relatively weak supervisions, such as scribbles [19], bounding boxes [27], or points [1],\nbut they still need large amount of hand labors. Therefore, semantic segmentation with image-level\nsupervision [25, 16, 35, 12, 34] is becoming a promising way to relief lots of human labors. In this\npaper, we are also interested in the problem of weakly-supervised semantic segmentation. As only\nimage-level labels are available, most recent approaches [16, 3, 12, 34, 9], more or less, rely on\ndifferent attention models due to their ability of covering small but discriminative semantic regions.\nTherefore, how to generate high-quality attention maps is essential for offering reliable initial heuristic\ncues for training segmentation networks. Earlier weakly-supervised semantic segmentation methods\n[16, 34] mostly adopt the original Class Activation Maps (CAM) model [42] for object localization.\nFor small objects, CAM does work well but when encountering large objects of large scales it can only\nlocalize small areas of discriminative regions, which is harmful for training segmentation networks in\nthat the undetected semantic regions will be judged to background.\nInterestingly, the adversarial erasing strategy [34, 18] (Fig. 1) has been proposed recently. Bene\ufb01ting\nfrom the powerful localization ability of CNNs, this type of methods is able to further discover\nmore object-related regions by erasing the detected regions. However, a key problem of this type of\n\n\u2217MM Cheng is the corresponding author of this paper. Project page: http://mmcheng.net/SeeNet/.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) A typical adversarial erasing approach [39], which is composed of an initial attention\ngenerator and a complementary attention generator; (b-d) Attention maps produced by (a) as the\ntraining iterations increase; (e) The attention map generated by our approach. As can be seen, the\nattentions by (a) gradually appear in unexpected regions while our results are con\ufb01ned in the bicycle\nregion properly.\n\nmethods is that as more semantic regions are mined, the attentions may spread to the background and\nfurther the localization ability of the initial attention generator is downgraded. For example, trains\noften run on rails and hence as trains are erased rails may be classi\ufb01ed as the train category, leading\nto negative in\ufb02uence on learning semantic segmentation networks.\nIn this paper, we propose a promising way to overcome the above mentioned drawback of the\nadversarial erasing strategy by introducing the concept of self-erasing. The background regions of\ncommon scenes often share some similarities, which motivates us to explicitly feed attention networks\nwith a roughly accurate background prior to con\ufb01ne the observable regions in semantic \ufb01elds. To do\nso, we present two self-erasing strategies by leveraging the background prior to purposefully suppress\nthe spread of attentions to the background regions. Moreover, we design a new attention network that\ntakes the above self-erasing strategies into account to discover more high-quality attentions from a\npotential zone instead of the whole image [39]. We apply our attention maps to weakly-supervised\nsemantic segmentation, evaluate the segmentation results on the PASCAL VOC 2012 [6] benchmark,\nand show substantial improvements compared to existing methods.\n\n2 Related Work\n\n2.1 Attention Networks\n\nEarlier Work. To date, a great number of attention networks have been developed, attempting to\nreveal the working mechanism of CNNs. At earlier stage, error back-propagation based methods\n[31, 37] were proposed for visualizing CNNs. CAM [42] adopted a global average pooling layer\nfollowed by a fully connected layer as a classi\ufb01er. Later, Selvaraju proposed the Grad-CAM, which\ncan be embedded into a variety of off-the-shelf available networks for visualizing multiple tasks, such\nas image captioning and image classi\ufb01cation. Zhang et al. [38], motivated by humans\u2019 visual system,\nused the winner-take-all strategy to back-propagate discriminative signals in a top-down manner. A\nsimilar property shared by the above methods is that they only attempt to produce an attention map.\n\nAdversarial Erasing Strategy.\nIn [34], Wei et al. proposed the adversarial erasing strategy, which\naims at discovering more unseen semantic objects. A CAM branch is used to determine an initial\nattention map and then a threshold is used to selectively erase the discovered regions from the images.\nThe erased images are then sent into another CNN to further mine more discriminative regions. In\n[39, 18], Zhang et al. and Li et al. extended the initial adversarial erasing strategy in an end-to-end\ntraining manner.\n\n2.2 Weakly-Supervised Semantic Segmentation\n\nDue to the fact that collecting pixel-level annotations is very expensive, more and more works are\nrecently focusing on weakly-supervised semantic segmentation. Besides some works relying on\nrelatively strong supervisions, such as scribble [19], points [1], and bounding boxes [27], most\n\n2\n\nconv1convscoreGAPloss1convscoreGAPloss2backboneInputeraseconv2conv3conv4conv5Initial attention generator(a)(b)(d)Complementary attention generator(c)(e)\f(a) Image\n\n(b) Attention\n\n(c) Mask\n\n(a) Image\n\n(b) Attention\n\n(c) Mask\n\nFigure 2: Illustrations explaining how to generate ternary masks. (a) Source images; (b) Initial\nattention maps produced by an initial attention generator; (c) Ternary masks after thresholding. Given\n(b), we separate each of them into three zones by setting two thresholds. The yellow zone in (c)\ncorresponds to larger attention values in (b). The dark zone corresponding to lower values is explicitly\nde\ufb01ned as background priors. The middle zone contains semantic objects with high probability. Note\nthat \ufb01gures in (c) are used for explanation only but actually they are ternary masks.\n\nweakly-supervised methods are based on only image-level labels or even inaccurate keyword [11].\nLimited by keyword-level supervision, many works [16, 34, 9, 12, 3, 28, 36] harnessed attention\nmodels [42, 38] for generating the initial seeds. Saliency cues [5, 2, 33, 10, 13] are also adopted by\nsome methods as the initial heuristic cues. Beyond that, there are also some works proposing different\nstrategies to solve this problem, such as multiple instance learning [26] and the EM algorithm [24].\n\n3 Self-Erasing Network\n\nIn this section, we describe details of the proposed Self-Erasing Network (SeeNet). An overview of\nour SeeNet can be found in Fig. 3. Before the formal description, we \ufb01rst introduce the intuition of\nour proposed approach.\n\n3.1 Observations\n\nAs stated in Sec. 1, with the increase of training iterations, adversarial erasing strategy tends to mine\nmore areas not belonging to any semantic objects at all. Thus, it is dif\ufb01cult to determine when the\ntraining phase should be ended. An illustration of this phenomenon has been depicted in Fig. 1. In\nfact, we humans always \u2018deliberately\u2019 suppress the areas that we are not interested in so as to better\nfocus on our attentions [17]. When looking at a large object, we often seek the most distinctive parts\nof the object \ufb01rst and then move the eyes to the other parts. In this process, humans are always able to\ninadvertently and successfully neglect the distractions brought by the background. However, attention\nnetworks themselves do not possess such capability with only image-level labels given. Therefore,\nhow to explicitly introduce background prior to attention networks is essential. Inspired by this\ncognitive process of humans, other than simply erasing the attention regions with higher con\ufb01dence\nas done in existing works [39, 18, 34], we propose to explicitly tell CNNs where the background is\nso as to let attention networks better focus on discovering real semantic objects.\n\n3.2 The Idea of Self-Erasing\n\nTo highlight the semantic regions and keep the detected attention areas from expanding to background\nareas, we propose the idea of self-erasing during training. Given an initial attention map (produced by\nSA in Fig. 3), we functionally separate the images into three zones in spatial dimension, the internal\n\u201cattention zone\u201d, the external \u201cbackground zone\u201d, and the middle \u201cpotential zone\u201d (Fig. 2c). By\nintroducing the background prior, we aim to drive attention networks into a self-erasing state so that\nthe observable regions can be restricted to non-background areas, avoiding the continuous spread of\nattention areas that are already near a state of perfection. To achieve this goal, we need to solve the\nfollowing two problems: (I) Given only image-level labels, how to de\ufb01ne and obtain the background\nzone. (II) How to introduce the self-erasing thought into attention networks.\n\nBackground priors. Regarding the circumstance of weak supervision, it is quite dif\ufb01cult to obtain\na precise background zone, so we have to seek what is less attractive than the above unreachable\nobjective to obtain relatively accurate background priors. Given the initial attention map MA, other\nthan thresholding MA with \u03b4 for a binary mask BA as in [39], we also consider using another constant\nwhich is less than \u03b4 to get a ternary mask TA. For notational convenience, we here use \u03b4h and \u03b4l\n\n3\n\n\fFigure 3: Overview of the proposed approach.\n\n(\u03b4h > \u03b4l) to denote the two thresholds. Regions with values less than \u03b4l in MA will all be treated as\nthe background zone. Thus, we de\ufb01ne our ternary mask TA as follows: TA,(i,j) = 0 if MA,(i,j) \u2265 \u03b4h,\nTA,(i,j) = \u22121 if MA,(i,j) < \u03b4l, and TA,(i,j) = 1 otherwise. This means the background zone is\nassociated with a value of -1 in TA. We empirically found that the resulting background zone covers\nmost of the real background areas for almost all input scenes. This is reasonable as SA is already\nable to locate parts of the semantic objects.\n\nConditionally Reversed Linear Units (C-ReLUs). With the background priors, we introduce the\nself-erasing strategies by reversing the signs of the feature maps corresponding to the background\noutputted by the backbone to make the potential zone stand out. To achieve so, we extend the ReLU\nlayer [22] to a more general case. Recall that the ReLU function, according to its de\ufb01nition, can be\nexpressed as ReLU(x) = max(0, x). More generally, our C-ReLU function takes a binary mask into\naccount and is de\ufb01ned as\n(1)\nwhere B is a binary mask, taking values from {\u22121, 1}. Unlike ReLUs outputting tensors with only\nnon-negative values, our C-ReLUs conditionally \ufb02ip the signs of some units according to a given\nmask. We expect that the attention networks can focus more on the regions with positive activations\nafter C-ReLU and further discover more semantic objects from the potential zone because of the\ncontrast between the potential zone and the background zone.\n\nC-ReLU(x) = max(x, 0) \u00d7 B(x),\n\n3.3 Self-Erasing Network\n\nOur architecture is composed of three branches after a shared backbone, denoted by SA, SB, and\nSC, respectively. Fig. 3 illustrates the overview of our proposed approach. Similarly to [39], our SA\nhas a similar structure to [39], the goal of which is to determine an initial attention. SB and SC have\nsimilar structures to SA but differently, the C-ReLU layer is inserted before each of them.\nSelf-erasing strategy I. By adding the second branch SB, we introduce the \ufb01rst self-erasing strategy.\nGiven the attention map MA produced by SA, we can obtain a ternary mask TA according to Sec. 3.2.\nWhen sending TA to the C-ReLU layer of SB, we can easily adjust TA to a binary mask by setting\nnon-negative values to 1. When taking the erasing strategy into account, we can extend the binary\nmask in C-ReLU function to a ternary case. Thus, Eqn. (1) can be rewritten as\n\nC-ReLU(x) = max(x, 0) \u00d7 TA(x).\n\n(2)\nAn visual illustration of Eqn. (2) has been depicted in Fig. 2c. The zone highlighted in yellow\ncorresponds to attentions detected by SA, which will be erased in the output of the backbone. Units\nwith positive values in the background zone will be reversed to highlight the potential zone. During\ntraining, SB will fall in a state of self-erasing, deterring the background stuffs from being discovered\nand meanwhile ensuring the potential zone to be distinctive.\nSelf-erasing strategy II. This strategy aims at further avoiding attentions appearing in the background\nzone by introducing another branch SC. Speci\ufb01cally, we \ufb01rst transform TA to a binary mask by\nsetting regions corresponding to the background zone to 1 and the rest regions to 0. In this way,\nonly the background zone of the output of the C-ReLU layer has non-zero activations. During the\ntraining phase, we let the probability of the background zone belonging to any semantic classes learn\nto be 0. Because of the background similarities among different images, this branch will help correct\n\n4\n\nBackboneConvAttentionTernary Mask TLossConvGAPSASBGAPLoss......LossConvGAPSC ...C-ReLU strategy ITTThresholdingGAPGlobal Ave Pool\u201cdining table\u201dC-ReLU strategy II\fAlgorithm 1: \u201cProxy ground-truth\u201d for training semantic segmentation networks\nInput\nOutput :Proxy ground-truth G\n\n:Image I with N pixels; Image labels y;\n\n1 Q = zeros(M + 1, N ), N is the number of pixels and M is the number of semantic classes;\n2 D = Saliency(I) ;\n3 for each pixel i \u2208 I do\nAy = SeeNet(I, y) ;\n4\nQ(0, i) \u2190 1 \u2212 D(i) ;\n5\nfor each label c \u2208 y do\n6\n7\n8\n9 end\n10 G \u2190 argmaxl\u2208{0,y} Q ;\n\n\u21d0 obtain the saliency map\n\u21d0 generate attention maps\n\u21d0 probability of position i being Background\n\u21d0 harmonic mean\n\nQ(c, i) \u2190 harm(D(i), Ac(i)) ;\n\nend\n\nthe wrongly predicted attentions in the background zone and indirectly avoid the wrong spread of\nattentions.\nThe overall loss function of our approach can be written as: L = LSA +LSB +LSC . For all branches,\nwe treat the multi-label classi\ufb01cation problem as M independent binary classi\ufb01cation problems by\nusing the cross-entropy loss, where M is the number of semantic categories. Therefore, given an\nimage I and its semantic labels y, the label vector for SA and SB is ln = 1 if n \u2208 y and 0 otherwise,\nwhere |l| = M. The label vector of SC is a zero vector, meaning that no semantic objects exist in the\nbackground zone.\nTo obtain the \ufb01nal attention maps, during the test phase, we discard the SC branch. Let MB be\nthe attention map produced by SB. We \ufb01rst normalize both MA and MB to the range [0, 1] and\ndenote the results as \u02c6MA and \u02c6MB. Then, the fused attention map MF is calculated by MF,i =\nmax( \u02c6MM,i, \u02c6MB,i). To obtain the \ufb01nal attention map, during the test phase, we also horizontally \ufb02ip\nthe input images and get another fused attention map MH. Therefore, our \ufb01nal attention map Mf inal\ncan be computed by Mf inal,i = max(MF,i, MH,i).\n\n4 Weakly-Supervised Semantic Segmentation\n\nTo test the quality of our proposed attention network, we applied the generated attention maps\nto the recently popular weakly-supervised semantic segmentation task. To compare with existing\nstate-of-the-art approaches, we follow a recent work [3], which leverages both saliency maps and\nattention maps. Instead of applying an erasing strategy to mine more salient regions, we simply use a\npopular salient object detection model [10] to extract the background prior by setting a hard threshold\nas in [18]. Speci\ufb01cally, given an input image I, we \ufb01rst simply normalize its saliency map obtaining\nD taking values from [0, 1]. Let y be the image-level label set of I taking values from {1, 2, . . . , M},\nwhere M is the number of semantic classes, and Ac be one of attention maps associated with label\nc \u2208 y. We can calculate our \u201cproxy ground-truth\u201d according to Algorithm 1. Following [3], here we\nharness the following harmonic mean function to compute the probability of pixel Ii belonging to\nclass c:\n\nharm(i) =\n\n(cid:0)w/(Ac(i)) + 1/D(i)(cid:1) .\n\nw + 1\n\n(3)\n\nParameter w here is used to control the importance of attention maps. In our experiments, we set w\nto 1.\n\n5 Experiments\n\nTo verify the effectiveness of our proposed self-erasing strategies, we apply our attention network\nto the weakly-supervised semantic segmentation task as an example application. We show that by\nembedding our attention results into a simple approach, our semantic segmentation results outperform\nthe existing state-of-the-arts.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Visual comparisons among results by different network settings. (a) Source images; (b)\nAttention maps produced by our SeeNet; (c) Attention maps produced by ACoL [39]; (d) Attention\nmaps produced by setting 2 in Sec. 5.2. The top two roles show results with small objects while\nthe bottom two lines show results with large objects. As can be seen, our approach is able to well\nsuppress the expansion of attentions to background regions and meanwhile generate relatively integral\nresults compared to another two settings.\n\n5.1\n\nImplementation Details\n\nDatasets and evaluation metrics. We evaluate our approach on the PASCAL VOC 2012 image\nsegmentation benchmark [6], which contains 20 semantic classes plus the background category. As\ndone in most previous works, we train our model for both the attention and segmentation tasks on the\ntraining set, which consists of 10,582 images, including the augmented training set provided by [7].\nWe compare our results with other works on both the validation and test sets, which have 1,449 and\n1,456 images, respectively. Similarly to previous works, we use the mean intersection-over-union\n(mIoU) as our evaluation metric.\nNetwork settings. For our attention network, we use VGGNet [32] as our base model as done in\n[39, 34]. We discard the last three fully-connected layers and connect three convolutional layers with\n512 channels and kernel size 3 to the backbone as in [39]. Then, a 20 channel convolutional layer,\nfollowed by a global average pooling layer is used to predict the probability of each category as done\nin [39]. We set the batch size to 16, weight decay 0.0002, and learning rate 0.001, divided by 10 after\n15,000 iterations. We run our network for totally 25,000 iterations. For data augmentation, we follow\nthe strategy used in [8]. Thresholds \u03b4h and \u03b4l in SB are set to 0.7 and 0.05 times of the maximum\nvalue of the attention map inputted to C-ReLU layer, respectively. For the threshold used in SC, the\nfactor is set to (\u03b4h + \u03b4l)/2. For segmentation task, to fairly compare with other works, we adopt\nthe standard Deeplab-LargeFOV architecture [4] as our segmentation network, which is based on\nthe VGGNet [32] pre-trained on the ImageNet dataset [29]. Similarly to [3], we also try the ResNet\nversion [8] Deeplab-LargeFOV architecture and report the results of both versions. The network and\nconditional random \ufb01elds (CRFs) hyper-parameters are the same to [4].\nInference. For our attention network, we resize the input images to a \ufb01xed size of 224\u00d7 224 and then\nresize the resulting attention map back to the original resolution. For segmentation task, following\n[20], we perform multi-scale test. For CRFs, we adopt the same code as in [4].\n\n5.2 The Role of Self-Erasing\n\nTo show the importance of our self-erasing strategies, we perform several ablation experiments\nin this subsection. Besides showing the results of our standard SeeNet (Fig. 3), we also consider\nimplementing another two network architectures and report the results. First, we re-implement the\nsimple erasing network (ACoL) proposed in [39] (setting 1). The hyper-parameters are all same to\nthe default ones in [39]. This architecture does not use our C-ReLU layer and does not have our SC\nbranch as well. Furthermore, to stress the importance of the conditionally sign-\ufb02ipping operation, we\n\n6\n\n\fSettings\n\nTraining set\n\nSupervision\n\nmIoU (val)\n\n1 (ACoL [39])\n2 (w/o sign-\ufb02ipping in C-ReLU)\n3 (Ours)\n\n10,582 VOC\n10,582 VOC\n10,582 VOC\n\nweakly\nweakly\nweakly\n\n56.1%\n55.8%\n57.3%\n\nTable 1: Quantitative comparisons with another two settings described in Sec. 5.2 on the validation\nset of PASCAL VOC 2012 segmentation benchmark [6]. The segmentation maps in this table are\ndirectly generated by segmentation networks without multi-scale test for fair comparisons. CRFs are\nnot used here as well.\n\nalso try to zero the feature units associated with the background regions and keep all other settings\nunchanged (setting 2).\nThe quality of attention maps. In Fig. 4, we sample some images from the PASCAL VOC 2012\ndataset and show the results by different experiment settings. When localizing small objects as shown\non the top two rows of Fig. 4, our attention network is able to better focus on the semantic objects\ncompared to the other two settings. This is due to the fact that our SC branch helps better recognize\nthe background regions and hence improves the ability of our approach to keep the attentions from\nexpanding to unexpected non-object regions. When encountering large objects as shown on the\nbottom two rows of Fig. 4, other than discovering where the semantic objects are, our approach is\nalso capable of mining relatively integral objects compared to the other settings. The conditional\nreversion operations also protect the attention areas from spreading to the background areas. This\nphenomenon is specially clear in the monitor image of Fig. 4.\nQuantitative results on PASCAL VOC 2012. Besides visual comparisons, we also consider report-\ning the results by applying our attention maps to the weakly-supervised semantic segmentation task.\nGiven the attention maps, we \ufb01rst carry out a series of operations following the instructions described\nin Sec. 4, yielding the proxy ground truths of the training set. We utilize the resulting proxy ground\ntruths as supervision to train the segmentation network. The quantitative results on the validation\nset are listed in Table 1. Note that the segmentation maps are all based on single-scale test and no\npost-processing tools are used, such as CRFs. According to Table 1, one can observe that with the\nsame saliency maps as background priors, our approach achieves the best results. Compared to the\napproach proposed in [39], we have a performance gain of 1.2% in terms of mIoU score, which\nre\ufb02ects the high quality of the attention maps produced by our approach.\n\n5.3 Comparison with the State-of-the-Arts\n\nIn this subsection, we compare our proposed approach with existing weakly-supervised semantic\nsegmentation methods that are based on image-level supervision. Detailed information for each\nmethod is shown in Table 2. We report the results of each method on both the validation and test sets.\nFrom Table 2, we can observe that our approach greatly outperforms all other methods when the\nsame base model, such as VGGNet [32], is used. Compared to DCSP [3], which leverages the same\nprocedures to produce the proxy ground-truths for segmentation segmentation network, we achieves a\nperformance gain of more than 2% on the validation set. This method uses the original CAM [42] as\ntheir attention map generator while our approach utilizes the attention maps produced by our SeeNet,\nwhich indirectly proofs the better performance of our attention network compared to CAM. To further\ncompare our attention network with adversarial erasing methods, such as AE-PSL [34] and GAIN\n[18], our segmentation results are also much better than theirs. This also re\ufb02ects the high quality of\nour attention maps.\n\n5.4 Discussions\n\nTo better understand the proposed network, we show some visual results produced by our segmentation\nnetwork in Fig. 6. As can be seen, our segmentation network works well because of the high-quality\nattention maps produced by our SeeNet. However, despite the good results, there are still a small\nnumber of failure cases, part of which has been shown on the bottom row of Fig. 6. These bad cases\nare often caused by the fact that semantic objects with different labels are frequently tied together,\n\n7\n\n\fMethods\n\nPublication\n\nSupervision\n\nmIoU (val)\n\nmIoU (test)\n\nw/o CRF w/ CRF\n\nw/ CRF\n\nCCNN [25]\nEM-Adapt [24]\nMIL [26]\nDCSM [30]\nSEC [16]\nAugFeed [27]\nSTC [35]\nRoy et al. [28]\nOh et al. [23]\nAE-PSL [34]\nHong et al. [9]\nWebS-i2 [14]\nDCSP-VGG16 [3]\nDCSP-ResNet101 [3]\nTPL [15]\nGAIN [39]\n\nICCV\u201915\nICCV\u201915\nCVPR\u201915\nECCV\u201916\nECCV\u201916\nECCV\u201916\nPAMI\u201916\nCVPR\u201917\nCVPR\u201917\nCVPR\u201917\nCVPR\u201917\nCVPR\u201917\nBMVC\u201917\nBMVC\u201917\nICCV\u201917\nCVPR\u201918\n\nSeeNet (Ours, VGG16)\nSeeNet (Ours, ResNet101)\n\n-\n-\n\n10K weak\n10K weak\n700K weak\n10K weak\n10K weak\n\n10K weak + bbox\n10K weak + sal\n\n10K weak\n\n10K weak + sal\n10K weak + sal\n10K + video weak\n\n19K weak\n\n10K weak + sal\n10K weak + sal\n\n10K weak\n\n10K weak + sal\n\n10K weak + sal\n10K weak + sal\n\n33.3%\n\n-\n\n42.0%\n\n-\n\n44.3%\n50.4%\n\n-\n-\n\n51.2%\n\n-\n-\n-\n\n56.5%\n59.5%\n\n-\n\n59.9%\n62.6%\n\n35.3%\n38.2%\n\n-\n\n44.1%\n50.7%\n54.3%\n49.8%\n52.8%\n55.7%\n55.0%\n58.1%\n53.4%\n58.6%\n60.8%\n53.1%\n55.3%\n\n61.1%\n63.1%\n\n-\n\n39.6%\n\n-\n\n45.1%\n51.7%\n55.5%\n51.2%\n53.7%\n56.7%\n55.7%\n58.7%\n55.3%\n59.2%\n61.9%\n53.8%\n56.8%\n\n60.7%\n62.8%\n\nTable 2: Quantitative comparisons with the existing state-of-the-art approaches on both validation\nand test sets. The word \u2018weak\u2019 here means supervision with only image-level labels. \u2018bbox\u2019 and \u2018sal\u2019\nmean that either bounding boxes or saliency maps are used. Without clear clari\ufb01cation, the methods\nlisted here are based on VGGNet [32] and Deeplab-LargeFOV framework.\n\nFigure 5: More visual results produced by our approach.\n\nmaking the attention models dif\ufb01cult to precisely separate them. Speci\ufb01cally, as attention models are\ntrained with only image-level labels, it is hard to capture perfectly integral objects. In Fig. 5, we show\nmore visual results sampled from the Pascal VOC dataset. As can be seen, some scenes are with\ncomplex background or low contrast between the semantic objects and the background. Although our\napproach has involved background priors to help con\ufb01ne the attention regions, when processing these\nkinds of images it is hard to localize the whole objects and the quality of the initial attention maps are\nalso essential. In addition, it is still dif\ufb01cult to deal with images with multiple semantic objects as\nshown in the \ufb01rst row of Fig. 5. The attention networks may easily predict which categories exist in\nthe images but localizing all the semantic objects are not easy. A promising way to solve this problem\nmight be incorporating a small number of pixel-level annotations for each category during the training\nphase to offer attention networks the information of boundaries. The pixel-level information will\ntell attention networks where the boundaries of semantic objects are and also accurate background\nregions that will help produce more integral results. This is also the future work that we will aim at.\n\n6 Conclusion\n\nIn this paper, we introduce the thought of self-erasing into attention networks. We propose to extract\nbackground information based on the initial attention maps produced by the initial attention generator\nby thresholding the maps into three zones. Given the roughly accurate background priors, we design\ntwo self-erasing strategies, both of which aim at prohibiting the attention regions from spreading\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Segmentation results produced by our approach. (a) Source images. (b) Ground-truth\nannotations. (c) Our segmentation results. Other than good examples (the top three rows), we also\nlist a couple of bad cases (the bottom row) to make readers better understand our work.\n\nto unexpected regions. Based on the two self-erasing strategies, we build a self-erasing attention\nnetwork to con\ufb01ne the observable regions in a potential zone which exists semantic objects with\nhigh probability. To evaluate the quality of the resulting attention maps, we apply them to the\nweakly-supervised semantic segmentation task by simply combining it with saliency maps. We\nshow that the segmentation results based on our proxy ground-truths greatly outperform existing\nstate-of-the-art results.\n\nAcknowledgments\n\nThis research was supported by NSFC (NO. 61620106008, 61572264), the national youth tal-\nent support program, Tianjin Natural Science Foundation for Distinguished Young Scholars (NO.\n17JCJQJC43700), and Huawei Innovation Research Program.\n\nReferences\n[1] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What\u2019s the point: Semantic segmenta-\n\ntion with point supervision. In ECCV, pages 549\u2013565, 2016.\n\n[2] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE TIP,\n\n24(12):5706\u20135722, 2015.\n\n[3] Arslan Chaudhry, Puneet K Dokania, and Philip HS Torr. Discovering class-speci\ufb01c pixels for weakly-\n\nsupervised semantic segmentation. BMVC, 2017.\n\n[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nIEEE TPAMI, 2017.\n\n[5] Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and Shi-Min Hu. Global contrast\n\nbased salient region detection. IEEE TPAMI, 37(3):569\u2013582, 2015.\n\n[6] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew\n\nZisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015.\n\n[7] Bharath Hariharan, Pablo Arbel\u00e1ez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\n\ncontours from inverse detectors. In ICCV, 2011.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n9\n\n\f[9] Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee, and Bohyung Han. Weakly supervised\n\nsemantic segmentation using web-crawled videos. In CVPR, pages 3626\u20133635, 2017.\n\n[10] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. Deeply supervised\n\nsalient object detection with short connections. IEEE TPAMI, 2018.\n\n[11] Qibin Hou, Ming-Ming Cheng, Jiangjiang Liu, and Philip HS Torr. Webseg: Learning semantic segmenta-\n\ntion from web searches. arXiv preprint arXiv:1803.09859, 2018.\n\n[12] Qibin Hou, Puneet Kumar Dokania, Daniela Massiceti, Yunchao Wei, Ming-Ming Cheng, and Philip Torr.\n\nBottom-up top-down cues for weakly-supervised semantic segmentation. EMMCVPR, 2017.\n\n[13] Huaizu Jiang, Ming-Ming Cheng, Shi-Jie Li, Ali Borji, and Jingdong Wang. Joint Salient Object Detection\n\nand Existence Prediction. Front. Comput. Sci., 2018.\n\n[14] Bin Jin, Maria V Ortiz Segovia, and Sabine Susstrunk. Webly supervised semantic segmentation. In CVPR,\n\npages 3626\u20133635, 2017.\n\n[15] Dahun Kim, Donggeun Yoo, In So Kweon, et al. Two-phase learning for weakly supervised object\n\nlocalization. In ICCV, 2017.\n\n[16] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for\n\nweakly-supervised image segmentation. In ECCV, 2016.\n\n[17] Fei Fei Li, Ru\ufb01n VanRullen, Christof Koch, and Pietro Perona. Rapid natural scene categorization in the\n\nnear absence of attention. Proceedings of the National Academy of Sciences, 99(14):9596\u20139601, 2002.\n\n[18] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. Tell me where to look: Guided attention\n\ninference network. In CVPR, 2018.\n\n[19] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional\n\nnetworks for semantic segmentation. In CVPR, 2016.\n\n[20] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks\n\nwith identity mappings for high-resolution semantic segmentation. In CVPR, 2017.\n\n[21] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015.\n\n[22] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML,\n\npages 807\u2013814, 2010.\n\n[23] Seong Joon Oh, Rodrigo Benenson, Anna Khoreva, Zeynep Akata, Mario Fritz, and Bernt Schiele.\n\nExploiting saliency for object segmentation from image level labels. In CVPR, 2017.\n\n[24] George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L Yuille. Weakly-and semi-supervised\n\nlearning of a dcnn for semantic image segmentation. In ICCV, 2015.\n\n[25] Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for\n\nweakly supervised segmentation. In ICCV, 2015.\n\n[26] Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional\n\nnetworks. In CVPR, 2015.\n\n[27] Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, and Jiaya Jia. Augmented feedback in\n\nsemantic segmentation under image level supervision. In ECCV, 2016.\n\n[28] Anirban Roy and Sinisa Todorovic. Combining bottom-up, top-down, and smoothness cues for weakly\n\nsupervised image segmentation. In CVPR, 2017.\n\n[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nIJCV, 2015.\n\n[30] Wataru Shimoda and Keiji Yanai. Distinct class-speci\ufb01c saliency maps for weakly supervised semantic\n\nsegmentation. In ECCV, pages 218\u2013234, 2016.\n\n[31] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. In ICLRW, 2014.\n\n10\n\n\f[32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\n[33] Jingdong Wang, Huaizu Jiang, Zejian Yuan, Ming-Ming Cheng, Xiaowei Hu, and Nanning Zheng. Salient\n\nobject detection: A discriminative regional feature integration approach. IJCV, 2017.\n\n[34] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object\nregion mining with adversarial erasing: A simple classi\ufb01cation to semantic segmentation approach. In\nCVPR, 2017.\n\n[35] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Xiaohui Shen, Ming-Ming Cheng, Jiashi Feng, Yao Zhao,\nand Shuicheng Yan. Stc: A simple to complex framework for weakly-supervised semantic segmentation.\nIEEE TPAMI, 2016.\n\n[36] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S Huang. Revisiting dilated\nconvolution: A simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, 2018.\n\n[37] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV,\n\npages 818\u2013833, 2014.\n\n[38] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention\n\nby excitation backprop. In ECCV, 2016.\n\n[39] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas Huang. Adversarial complementary\n\nlearning for weakly supervised object localization. In CVPR, 2018.\n\n[40] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing\n\nnetwork. In CVPR, 2017.\n\n[41] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du,\nChang Huang, and Philip HS Torr. Conditional random \ufb01elds as recurrent neural networks. In ICCV, 2015.\n\n[42] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features\n\nfor discriminative localization. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 332, "authors": [{"given_name": "Qibin", "family_name": "Hou", "institution": "Nankai University"}, {"given_name": "PengTao", "family_name": "Jiang", "institution": "Nankai University"}, {"given_name": "Yunchao", "family_name": "Wei", "institution": "UIUC"}, {"given_name": "Ming-Ming", "family_name": "Cheng", "institution": "Nankai University"}]}