{"title": "Learning to Segment Object Candidates", "book": "Advances in Neural Information Processing Systems", "page_first": 1990, "page_last": 1998, "abstract": "Recent object detection systems rely on two critical steps: (1) a set of object proposals is predicted as efficiently as possible, and (2) this set of candidate proposals is then passed to an object classifier. Such approaches have been shown they can be fast, while achieving the state of the art in detection performance. In this paper, we propose a new way to generate object proposals, introducing an approach based on a discriminative convolutional network. Our model is trained jointly with two objectives: given an image patch, the first part of the system outputs a class-agnostic segmentation mask, while the second part of the system outputs the likelihood of the patch being centered on a full object. At test time, the model is efficiently applied on the whole test image and generates a set of segmentation masks, each of them being assigned with a corresponding object likelihood score. We show that our model yields significant improvements over state-of-the-art object proposal algorithms. In particular, compared to previous approaches, our model obtains substantially higher object recall using fewer proposals. We also show that our model is able to generalize to unseen categories it has not seen during training. Unlike all previous approaches for generating object masks, we do not rely on edges, superpixels, or any other form of low-level segmentation.", "full_text": "Learning to Segment Object Candidates\n\nPedro O. Pinheiro\u2217\n\npedro@opinheiro.com\n\nRonan Collobert\nlocronan@fb.com\n\nPiotr Doll\u00b4ar\n\npdollar@fb.com\n\nFacebook AI Research\n\nAbstract\n\nRecent object detection systems rely on two critical steps: (1) a set of object pro-\nposals is predicted as ef\ufb01ciently as possible, and (2) this set of candidate proposals\nis then passed to an object classi\ufb01er. Such approaches have been shown they can\nbe fast, while achieving the state of the art in detection performance. In this pa-\nper, we propose a new way to generate object proposals, introducing an approach\nbased on a discriminative convolutional network. Our model is trained jointly\nwith two objectives: given an image patch, the \ufb01rst part of the system outputs a\nclass-agnostic segmentation mask, while the second part of the system outputs the\nlikelihood of the patch being centered on a full object. At test time, the model\nis ef\ufb01ciently applied on the whole test image and generates a set of segmenta-\ntion masks, each of them being assigned with a corresponding object likelihood\nscore. We show that our model yields signi\ufb01cant improvements over state-of-the-\nart object proposal algorithms. In particular, compared to previous approaches,\nour model obtains substantially higher object recall using fewer proposals. We\nalso show that our model is able to generalize to unseen categories it has not seen\nduring training. Unlike all previous approaches for generating object masks, we\ndo not rely on edges, superpixels, or any other form of low-level segmentation.\n\n1\n\nIntroduction\n\nObject detection is one of the most foundational tasks in computer vision [21]. Until recently, the\ndominant paradigm in object detection was the sliding window framework: a classi\ufb01er is applied at\nevery object location and scale [4, 8, 32]. More recently, Girshick et al. [10] proposed a two-phase\napproach. First, a rich set of object proposals (i.e., a set of image regions which are likely to contain\nan object) is generated using a fast (but possibly imprecise) algorithm. Second, a convolutional\nneural network classi\ufb01er is applied on each of the proposals. This approach provides a notable gain\nin object detection accuracy compared to classic sliding window approaches. Since then, most state-\nof-the-art object detectors for both the PASCAL VOC [7] and ImageNet [5] datasets rely on object\nproposals as a \ufb01rst preprocessing step [10, 15, 33].\nObject proposal algorithms aim to \ufb01nd diverse regions in an image which are likely to contain\nobjects. For ef\ufb01ciency and detection performance reasons, an ideal proposal method should possess\nthree key characteristics: (i) high recall (i.e., the proposed regions should contain the maximum\nnumber of possible objects), (ii) the high recall should be achieved with the minimum number of\nregions possible, and (iii) the proposed regions should match the objects as accurately as possible.\nIn this paper, we present an object proposal algorithm based on Convolutional Networks (Con-\nvNets) [20] that satis\ufb01es these constraints better than existing approaches. ConvNets are an im-\nportant class of algorithms which have been shown to be state of the art in many large scale object\nrecognition tasks. They can be seen as a hierarchy of trainable \ufb01lters, interleaved with non-linearities\n\u2217Pedro O. Pinheiro is with the Idiap Research Institute in Martigny, Switzerland and Ecole Polytechnique\n\nF\u00b4ed\u00b4erale de Lausanne (EPFL) in Lausanne, Switzerland. This work was done during an internship at FAIR.\n\n1\n\n\fand pooling. ConvNets saw a resurgence after Krizhevsky et al. [18] demonstrated that they per-\nform very well on the ImageNet classi\ufb01cation benchmark. Moreover, these models learn suf\ufb01ciently\ngeneral image features, which can be transferred to many different tasks [10, 11, 3, 22, 23].\nGiven an input image patch, our algorithm generates a class-agnostic mask and an associated score\nwhich estimates the likelihood of the patch fully containing a centered object (without any notion\nof an object category). The core of our model is a ConvNet which jointly predicts the mask and the\nobject score. A large part of the network is shared between those two tasks: only the last few network\nlayers are specialized for separately outputting a mask and score prediction. The model is trained by\noptimizing a cost function that targets both tasks simultaneously. We train on MS COCO [21] and\nevaluate the model on two object detection datasets, PASCAL VOC [7] and MS COCO.\nBy leveraging powerful ConvNet feature representations trained on ImageNet and adapted on the\nlarge amount of segmented training data available in COCO, we are able to beat the state of the art\nin object proposals generation under multiple scenarios. Our most notable achievement is that our\napproach beats other methods by a large margin while considering a smaller number of proposals.\nMoreover, we demonstrate the generalization capabilities of our model by testing it on object cate-\ngories not seen during training. Finally, unlike all previous approaches for generating segmentation\nproposals, we do not rely on edges, superpixels, or any other form of low-level segmentation. Our\napproach is the \ufb01rst to learn to generate segmentation proposals directly from raw image data.\nThe paper is organized as follows: \u00a72 presents related work, \u00a73 describes our architecture choices,\nand \u00a74 describes our experiments in different datasets. We conclude in \u00a75.\n\n2 Related Work\n\nIn recent years, ConvNets have been widely used in the context of object recognition. Notable\nsystems are AlexNet [18] and more recently GoogLeNet [29] and VGG [27], which perform ex-\nIn the setting of object detection, Girshick et al. [10] proposed\nceptionally well on ImageNet.\nR-CNN, a ConvNet-based model that beats by a large margin models relying on hand-designed\nfeatures. Their approach can be divided into two steps: selection of a set of salient object propos-\nals [31], followed by a ConvNet classi\ufb01er [18, 27]. Currently, most state-of-the-art object detection\napproaches [30, 12, 9, 25] rely on this pipeline. Although they are slightly different in the classi\ufb01-\ncation step, they all share the \ufb01rst step, which consist of choosing a rich set of object proposals.\nMost object proposal approaches leverage low-level grouping and saliency cues. These approaches\nusually fall into three categories: (1) objectness scoring [1, 34], in which proposals are extracted by\nmeasuring the objectness score of bounding boxes, (2) seed segmentation [14, 16, 17], where models\nstart with multiple seed regions and generate separate foreground-background segmentation for each\nseed, and (3) superpixel merging [31, 24], where multiple over-segmentations are merged according\nto various heuristics. These models vary in terms of the type of proposal generated (bounding boxes\nor segmentation masks) and if the proposals are ranked or not. For a more complete survey of object\nproposal methods, we recommend the recent survey from Hosang et al. [13].\nAlthough our model shares high level similarities with these approaches (we generate a set of ranked\nsegmentation proposals), these results are achieved quite differently. All previous approaches for\ngenerating segmentation masks, including [17] which has a learning component, rely on low-level\nsegmentations such as superpixels or edges. Instead, we propose a data-driven discriminative ap-\nproach based on a deep-network architecture to obtain our segmentation proposals.\nMost closely related to our approach, Multibox [6, 30] proposed to train a ConvNet model to gen-\nerate bounding box object proposals. Their approach, similar to ours, generates a set of ranked\nclass-agnostic proposals. However, our model generates segmentation proposals instead of the less\ninformative bounding box proposals. Moreover, the model architectures, training scheme, etc., are\nquite different between our approach and [30]. More recently, Deepbox [19] proposed a ConvNet\nmodel that learns to rerank proposals generated by EdgeBox, a bottom-up method for bounding box\nproposals. This system shares some similarities to our scoring network. Our model, however, is\nable to generate the proposals and rank them in one shot from the test image, directly from the pixel\nspace. Finally, concurrently with this work, Ren et al. [25] proposed \u2018region proposal networks\u2019 for\ngenerating box proposals that shares similarities with our work. We emphasize, however, that unlike\nall these approaches our method generates segmentation masks instead of bounding boxes.\n\n2\n\n\fFigure 1: (Top) Model architecture: the network is split into two branches after the shared feature\nextraction layers. The top branch predicts a segmentation mask for the the object located at the\ncenter while the bottom branch predicts an object score for the input patch. (Bottom) Examples\nof training triplets: input patch x, mask m and label y. Green patches contain objects that satisfy\nthe speci\ufb01ed constraints and therefore are assigned the label y = 1. Note that masks for negative\nexamples (shown in red) are not used and are shown for illustrative purposes only.\n\n3 DeepMask Proposals\n\nOur object proposal method predicts a segmentation mask given an input patch, and assigns a score\ncorresponding to how likely the patch is to contain an object.\nBoth mask and score predictions are achieved with a single convolutional network. ConvNets are\n\ufb02exible models which can be applied to various computer vision tasks and they alleviate the need\nfor manually designed features. Their \ufb02exible nature allows us to design a model in which the two\ntasks (mask and score predictions) can share most of the layers of the network. Only the last layers\nare task-speci\ufb01c (see Figure 1). During training, the two tasks are learned jointly. Compared to a\nmodel which would have two distinct networks for the two tasks, this architecture choice reduces\nthe capacity of the model and increases the speed of full scene inference at test time.\nEach sample k in the training set is a triplet containing (1) the RGB input patch xk, (2) the binary\nk \u2208 {\u00b11}, where (i, j) corresponds to a pixel\nmask corresponding to the input patch mk (with mij\nlocation on the input patch) and (3) a label yk \u2208 {\u00b11} which speci\ufb01es whether the patch contains\nan object. Speci\ufb01cally, a patch xk is given label yk = 1 if it satis\ufb01es the following constraints:\n\n(i) the patch contains an object roughly centered in the input patch\n(ii) the object is fully contained in the patch and in a given scale range\n\nOtherwise, yk = \u22121, even if an object is partially present. The positional and scale tolerance used in\nour experiments are given shortly. Assuming yk = 1, the ground truth mask mk has positive values\nonly for the pixels that are part of the single object located in the center of the patch. If yk = \u22121 the\nmask is not used. Figure 1, bottom, shows examples of training triplets.\nFigure 1, top, illustrates an overall view of our model, which we call DeepMask. The top branch is\nresponsible for predicting a high quality object segmentation mask and the bottom branch predicts\nthe likelihood that an object is present and satis\ufb01es the above two constraints. We next describe in\ndetail each part of the architecture, the training procedure, and the fast inference procedure.\n\n3.1 Network Architecture\n\nThe parameters for the layers shared between the mask prediction and the object score prediction are\ninitialized with a network that was pre-trained to perform classi\ufb01cation on the ImageNet dataset [5].\nThis model is then \ufb01ne-tuned for generating object proposals during training. We choose the VGG-\nA architecture [27] which consists of eight 3 \u00d7 3 convolutional layers (followed by ReLU non-\nlinearities) and \ufb01ve 2 \u00d7 2 max-pooling layers and has shown excellent performance.\n\n3\n\nVGG#1x1#conv#2x2#pool##x:#3x224x224#512x14x14#512x7x7#512x1x1#1024x1x1#fsegm(x):#224x224#fscore(x):#1x1#512x14x14#512x1x1#56x56#\fAs we are interested in inferring segmentation masks, the spatial information provided in the con-\nvolutional feature maps is important. We therefore remove all the \ufb01nal fully connected layers of the\nVGG-A model. Additionally we also discard the last max-pooling layer. The output of the shared\nlayers has a downsampling factor of 16 due to the remaining four 2 \u00d7 2 max-pooling layers; given\nan input image of dimension 3 \u00d7 h \u00d7 w, the output is a feature map of dimensions 512 \u00d7 h\n16 \u00d7 w\n16.\nSegmentation: The branch of the network dedicated to segmentation is composed of a single 1 \u00d7 1\nconvolution layer (and ReLU non-linearity) followed by a classi\ufb01cation layer. The classi\ufb01cation\nlayer consists of h\u00d7w pixel classi\ufb01ers, each responsible for indicating whether a given pixel belongs\nto the object in the center of the patch. Note that each pixel classi\ufb01er in the output plane must be\nable to utilize information contained in the entire feature map, and thus have a complete view of the\nobject. This is critical because unlike in semantic segmentation, our network must output a mask for\na single object even when multiple objects are present (e.g., see the elephants in Fig. 1).\nFor the classi\ufb01cation layer one could use either locally or fully connected pixel classi\ufb01ers. Both\noptions have drawbacks: in the former each classi\ufb01er has only a partial view of the object while\nin the latter the classi\ufb01ers have a massive number of redundant parameters.\nInstead, we opt to\ndecompose the classi\ufb01cation layer into two linear layers with no non-linearity in between. This\ncan be viewed as a \u2018low-rank\u2019 variant of using fully connected linear classi\ufb01ers. Such an approach\nmassively reduces the number of network parameters while allowing each pixel classi\ufb01er to leverage\ninformation from the entire feature map. Its effectiveness is shown in the experiments. Finally, to\nfurther reduce model capacity, we set the output of the classi\ufb01cation layer to be ho\u00d7wo with ho < h\nand wo < w and upsample the output to h \u00d7 w to match the input dimensions.\nScoring: The second branch of the network is dedicated to predicting if an image patch satis\ufb01es\nconstraints (i) and (ii): that is if an object is centered in the patch and at the appropriate scale. It is\ncomposed of a 2\u00d7 2 max-pooling layer, followed by two fully connected (plus ReLU non-linearity)\nlayers. The \ufb01nal output is a single \u2018objectness\u2019 score indicating the presence of an object in the\ncenter of the input patch (and at the appropriate scale).\n3.2 Joint Learning\nGiven an input patch xk \u2208 I, the model is trained to jointly infer a pixel-wise segmentation mask and\nan object score. The loss function is a sum of binary logistic regression losses, one for each location\nof the segmentation network and one for the object score, over all training triplets (xk, mk, yk):\n\n(cid:18)\n\n(cid:88)\n\n(cid:88)\n\n(cid:19)\n\nL(\u03b8) =\n\n1+yk\n2woho\n\nlog(1 + e\u2212mij\n\nk f ij\n\nsegm(xk)) + \u03bb log(1 + e\u2212ykfscore(xk))\n\n(1)\n\nk\n\nij\nHere \u03b8 is the set of parameters, f ij\nsegm(xk) is the prediction of the segmentation network at location\n(i, j), and fscore(xk) is the predicted object score. We alternate between backpropagating through\n32). For the scoring branch, the data is\nthe segmentation branch and scoring branch (and set \u03bb = 1\nsampled such that the model is trained with an equal number of positive and negative samples.\nNote that the factor multiplying the \ufb01rst term of Equation 1 implies that we only backpropagate the\nerror over the segmentation branch if yk = 1. An alternative would be to train the segmentation\nbranch using negatives as well (setting mij\nk = 0 for all pixels if yk = 0). However, we found that\ntraining with positives only was critical for generalizing beyond the object categories seen during\ntraining and for achieving high object recall. This way, during inference the network attempts to\ngenerate a segmentation mask at every patch, even if no known object is present.\n3.3 Full Scene Inference\nDuring full image inference, we apply the model densely at multiple locations and scales. This\nis necessary so that for each object in the image we test at least one patch that fully contains the\nobject (roughly centered and at the appropriate scale), satisfying the two assumptions made during\ntraining. This procedure gives a segmentation mask and object score at each image location. Figure 2\nillustrates the segmentation output when the model is applied densely to an image at a single scale.\nThe full image inference procedure is ef\ufb01cient since all computations can be computed convolution-\nally. The VGG features can be computed densely in a fraction of a second given a typical input\nimage. For the segmentation branch, the last fully connected layer can be computed via convolu-\ntions applied to the VGG features. The scores are likewise computed by convolutions on the VGG\nfeatures followed by two 1 \u00d7 1 convolutional layers. Exact runtimes are given in \u00a74.\n\n4\n\n\fFigure 2: Output of segmentation model applied densely to a full image with a 16 pixel stride (at a\nsingle scale at the central horizontal image region). Multiple locations give rise to good masks for\neach of the three monkeys (scores not shown). Note that no monkeys appeared in our training set.\n\nFinally, note that the scoring branch of the network has a downsampling factor 2\u00d7 larger than the\nsegmentation branch due to the additional max-pooling layer. Given an input test image of size\nht \u00d7 wt, the segmentation and object network generate outputs of dimension ht\n32 \u00d7 wt\n32 ,\nrespectively.\nIn order to achieve a one-to-one mapping between the mask prediction and object\nscore, we apply the interleaving trick right before the last max-pooling layer for the scoring branch\nto double its output resolution (we use exactly the implementation described in [26]).\n\n16 \u00d7 wt\n\n16 and ht\n\n3.4\n\nImplementation Details\n\nDuring training, an input patch xk is considered to contain a \u2018canonical\u2019 positive example if an object\nis precisely centered in the patch and has maximal dimension equal to exactly 128 pixels. However,\nhaving some tolerance in the position of an object within a patch is critical as during full image\ninference most objects will be observed slightly offset from their canonical position. Therefore,\nduring training, we randomly jitter each \u2018canonical\u2019 positive example to increase the robustness of\nour model. Speci\ufb01cally, we consider translation shift (of \u00b116 pixels), scale deformation (of 2\u00b11/4),\nand also horizontal \ufb02ip. In all cases we apply the same transformation to both the image patch xk\nand the ground truth mask mk and assign the example a positive label yk = 1. Negative examples\n(yk = \u22121) are any patches at least \u00b132 pixels or 2\u00b11 in scale from any canonical positive example.\nDuring full image inference we apply the model densely at multiple locations (with a stride of 16\npixels) and scales (scales 2\u22122 to 21 with a step of 21/2). This ensures that there is at least one tested\nimage patch that fully contains each object in the image (within the tolerances used during training).\nAs in the original VGG-A network [27], our model is fed with RGB input patches of dimension\n3 \u00d7 224 \u00d7 224. Since we removed the \ufb01fth pooling layer, the common branch outputs a feature map\nof dimensions 512 \u00d7 14 \u00d7 14. The score branch of our network is composed of 2 \u00d7 2 max pooling\nfollowed by two fully connected layers (with 512 and 1024 hidden units, respectively). Both of these\nlayers are followed by ReLU non-linearity and a dropout [28] procedure with a rate of 0.5. A \ufb01nal\nlinear layer then generates the object score.\nThe segmentation branch begins with a single 1\u00d7 1 convolutional layer with 512 units. This feature\nmap is then fully connected to a low dimensional output of size 512, which is further fully connected\nto each pixel classi\ufb01er to generate an output of dimension 56 \u00d7 56. As discussed, there is no non-\nlinearity between these two layers. In total, our model contains around 75M parameters.\nA \ufb01nal bilinear upsampling layer is added to transform the 56 \u00d7 56 output prediction to the full\n224 \u00d7 224 resolution of the ground-truth (directly predicting the full resolution output would have\nbeen much slower). We opted for a non-trainable layer as we observed that a trainable one simply\nlearned to bilinearly upsample. Alternatively, we tried downsampling the ground-truth instead of\nupsampling the network output; however, we found that doing so slightly reduced accuracy.\nDesign architecture and hyper-parameters were chosen using a subset of the MS COCO validation\ndata [21] (non-overlapping with the data we used for evaluation). We considered a learning rate\nof .001. We trained our model using stochastic gradient descent with a batch size of 32 examples,\nmomentum of .9, and weight decay of .00005. Aside from the pre-trained VGG features, weights\nare initialized randomly from a uniform distribution. Our model takes around 5 days to train on a\nNvidia Tesla K40m. To binarize predicted masks we simply threshold the continuous output (using\na threshold of .1 for PASCAL and .2 for COCO). All the experiments were conducted using Torch71.\n\n1http://torch.ch\n\n5\n\n\fFigure 3: DeepMask proposals with highest IoU to the ground truth on selected images from COCO.\nMissed objects (no matching proposals with IoU > 0.5) are marked with a red outline.\n\n4 Experimental Results\n\nIn this section, we evaluate the performance of our approach on the PASCAL VOC 2007 test set [7]\nand on the \ufb01rst 5000 images of the MS COCO 2014 validation set [21]. Our model is trained on\nthe COCO training set which contains about 80,000 images and a total of nearly 500,000 segmented\nobjects. Although our model is trained to generate segmentation proposals, it can also be used to\nprovide box proposals by taking the bounding boxes enclosing the segmentation masks. Figure 3\nshows examples of generated proposals with highest IoU to the ground truth on COCO.\nMetrics: We measure accuracy using the common Intersection over Union (IoU) metric. IoU is the\nintersection of a candidate proposal and ground-truth annotation divided by the area of their union.\nThis metric can be applied to both segmentation and box proposals. Following Hosang et al. [13],\nwe evaluate the performance of the proposal methods considering the average recall (AR) between\nIoU 0.5 and 1.0 for a \ufb01xed number of proposals. AR has been shown to correlate extremely well\nwith detector performance (recall at a single IoU threshold is far less predictive) [13].\nMethods: We compare to the current top-\ufb01ve publicly-available proposal methods including: Edge-\nBoxes [34], SelectiveSearch [31], Geodesic [16], Rigor [14], and MCG [24]. These methods achieve\ntop results on object detection (when coupled with R-CNNs [10]) and also obtain the best AR [13].\nResults: Figure 4 (a-c) compares the performance of our approach, DeepMask, to existing proposal\nmethods on PASCAL (using boxes) and COCO (using both boxes and segmentations). Shown is the\nAR of each method as a function of the number of generated proposals. Under all scenarios Deep-\nMask (and its variants) achieves substantially better AR for all numbers of proposals considered. AR\nat selected proposal counts and averaged across all counts (AUC) is reported in Tables 1 and 2 for\nCOCO and PASCAL, respectively. Notably, DeepMask achieves an order of magnitude reduction\nin the number of proposals necessary to reach a given AR under most scenarios. For example, with\n100 segmentation proposals DeepMask achieves an AR of .245 on COCO while competing methods\nrequire nearly 1000 segmentation proposals to achieve similar AR.\n\n6\n\n\f(a) Box proposals on PASCAL.\n\n(b) Box proposals on COCO.\n\n(c) Segm. proposals on COCO.\n\n(d) Small objects (area< 322).\n\n(e) Medium objects.\n\n(f) Large objects (area> 962).\n\n(g) Recall with 10 proposals.\n\n(h) Recall with 100 proposals.\n\n(i) Recall with 1000 proposals.\n\nFigure 4: (a-c) Average recall versus number of box and segmentation proposals on various datasets.\n(d-f) AR versus number of proposals for different object scales on segmentation proposals in COCO.\n(g-h) Recall versus IoU threshold for different number of segmentation proposals in COCO.\n\nBox Proposals\n\nSegmentation Proposals\n\nEdgeBoxes [34]\nGeodesic [16]\nRigor [14]\nSelectiveSearch [31]\nMCG [24]\n\nDeepMask20\nDeepMask20\u2217\nDeepMaskZoom\nDeepMaskFull\n\nDeepMask\n\nAR@10 AR@100 AR@1000 AUC\n.139\n.126\n.101\n.126\n.180\n\n.338\n.359\n.337\n.357\n.398\n\n.178\n.180\n.133\n.163\n.246\n\n.074\n.040\n\n-\n\n.052\n.101\n\n.139\n.152\n.150\n.149\n.153\n\n.286\n.306\n.326\n.310\n\n.313\n\n.431\n.432\n.482\n.442\n\n.446\n\n.217\n.228\n.242\n.231\n\n.233\n\nAR@10 AR@100 AR@1000 AUCS AUCM AUCL AUC\n\n-\n\n.023\n\n-\n\n.025\n.077\n\n.109\n.123\n.127\n.118\n\n.126\n\n-\n\n.123\n.094\n.095\n.186\n\n.215\n.233\n.261\n.235\n\n.245\n\n-\n\n.253\n.253\n.230\n.299\n\n.314\n.314\n.366\n.323\n\n.331\n\n-\n\n.013\n.022\n.006\n.031\n\n.020\n.020\n.068\n.020\n\n.023\n\n-\n\n.086\n.060\n.055\n.129\n\n.227\n.257\n.263\n.244\n\n.266\n\n-\n\n.205\n.178\n.214\n.324\n\n.317\n.321\n.308\n.342\n\n.336\n\n-\n\n.085\n.074\n.074\n.137\n\n.164\n.175\n.194\n.176\n\n.183\n\nTable 1: Results on the MS COCO dataset for both bounding box and segmentation proposals. We\nreport AR at different number of proposals (10, 100 and 1000) and also AUC (AR averaged across\nall proposal counts). For segmentation proposals we report overall AUC and also AUC at different\nscales (small/medium/large objects indicated by superscripts S/M/L). See text for details.\n\nScale: The COCO dataset contains objects in a wide range of scales. In order to analyze performance\nin more detail, we divided the objects in the validation set into roughly equally sized sets according\nto object pixel area a: small (a < 322), medium (322 \u2264 a \u2264 962), and large (a > 962) objects.\nFigure 4 (d-f) shows performance at each scale; all models perform poorly on small objects. To\nimprove accuracy of DeepMask we apply it at an additional smaller scale (DeepMaskZoom). This\nboosts performance (especially for small objects) but at a cost of increased inference time.\n\n7\n\n# proposals100101102103average recall00.10.20.30.40.50.60.7DeepMaskMCGSelectiveSearchRigorGeodesicEdgeBoxes# proposals100101102103average recall00.10.20.30.40.50.6DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesicEdgeBoxes# proposals100101102103average recall00.10.20.30.40.50.6DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesic# proposals100101102103average recall00.10.20.30.40.50.6DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesic# proposals100101102103average recall00.10.20.30.40.50.6DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesic# proposals100101102103average recall00.10.20.30.40.50.6DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesicIoU0.50.60.70.80.91recall00.10.20.30.40.50.60.7DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesicIoU0.50.60.70.80.91recall00.10.20.30.40.50.60.7DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesicIoU0.50.60.70.80.91recall00.10.20.30.40.50.60.7DeepMaskDeepMaskZoomMCGSelectiveSearchRigorGeodesic\fPASCAL VOC07\nEdgeBoxes [34]\nGeodesic [16]\nRigor [14]\nSelectiveSearch [31]\nMCG [24]\nDeepMask\n\nAR@10 AR@100 AR@1000 AUC\n.309\n.230\n.239\n.241\n.344\n.433\n\n.601\n.596\n.589\n.618\n.634\n.690\n\n.203\n.121\n.164\n.085\n.232\n.337\n\n.407\n.364\n.321\n.347\n.462\n.561\n\nTable 2: Results on PASCAL VOC 2007 test.\n\nFigure 5: Fast R-CNN results on PASCAL.\n\nLocalization: Figure 4 (g-i) shows the recall each model achieves as the IoU varies, shown for\ndifferent number of proposals per image. DeepMask achieves a higher recall in virtually every\nscenario, except at very high IoU, in which it falls slightly below other models. This is likely due\nto the fact that our method outputs a downsampled version of the mask at each location and scale; a\nmultiscale approach or skip connections could improve localization at very high IoU.\nGeneralization: To see if our approach can generalize to unseen classes [2, 19], we train two ad-\nditional versions of our model, DeepMask20 and DeepMask20\u2217. DeepMask20 is trained only with\nobjects belonging to one of the 20 PASCAL categories (subset of the full 80 COCO categories).\nDeepMask20\u2217 is similar, except we use the scoring network from the original DeepMask. Results\nfor the two models when evaluated on all 80 COCO categories (as in all other experiments) are\nshown in Table 1. Compared to DeepMask, DeepMask20 exhibits a drop in AR (but still outper-\nforms all previous methods). DeepMask20\u2217, however, matches the performance of DeepMask. This\nsurprising result demonstrates that the drop in accuracy is due to the discriminatively trained scoring\nbranch (DeepMask20 is inadvertently trained to assign low scores to the other 60 categories); the\nsegmentation branch generalizes extremely well even when trained on a reduced set of categories.\nArchitecture: In the segmentation branch, the convolutional features are fully connected to a 512\n\u2018low-rank\u2019 layer which is in turn connected to the 56\u00d756 output (with no intermediate non-linearity),\nsee \u00a73. We also experimented with a \u2018full-rank\u2019 architecture (DeepMaskFull) with over 300M pa-\nrameters where each of the 56\u00d7 56 outputs was directly connected to the convolutional features. As\ncan be seen in Table 1, DeepMaskFull is slightly inferior to our \ufb01nal model (and much slower).\nDetection: As a \ufb01nal validation, we evaluate how DeepMask performs when coupled with an object\ndetector on PASCAL VOC 2007 test. We re-train and evaluate the state-of-the-art Fast R-CNN [9]\nusing proposals generated by SelectiveSearch [31] and our method. Figure 5 shows the mean average\nprecision (mAP) for Fast R-CNN with varying number of proposals. Most notably, with just 100\nDeepMask proposals Fast R-CNN achieves mAP of 68.2% and outperforms the best results obtained\nwith 2000 SelectiveSearch proposals (mAP of 66.9%). We emphasize that with 20\u00d7 fewer proposals\nDeepMask outperforms SelectiveSearch (this is consistent with the AR numbers in Table 1). With\n500 DeepMask proposals, Fast R-CNN improves to 69.9% mAP, after which performance begins to\ndegrade (a similar effect was observed in [9]).\nSpeed: Inference takes an average of 1.6s per image in the COCO dataset (1.2s on the smaller\nPASCAL images). Our runtime is competitive with the fastest segmentation proposal methods\n(Geodesic [16] runs at \u223c1s per PASCAL image) and substantially faster than most (e.g., MCG [24]\ntakes \u223c30s). Inference time can further be dropped by \u223c30% by parallelizing all scales in a single\nbatch (eliminating GPU overhead). We do, however, require use of a GPU for ef\ufb01cient inference.\n\n5 Conclusion\n\nIn this paper, we propose an innovative framework to generate segmentation object proposals di-\nrectly from image pixels. At test time, the model is applied densely over the entire image at multiple\nscales and generates a set of ranked segmentation proposals. We show that learning features for\nobject proposal generation is not only feasible but effective. Our approach surpasses the previous\nstate of the art by a large margin in both box and segmentation proposal generation. In future work,\nwe plan on coupling our proposal method more closely with state-of-the-art detection approaches.\n\nAcknowledgements: We would like to thank Ahmad Humayun and Tsung-Yi Lin for help with generat-\ning experimental results, Andrew Tulloch, Omry Yadan and Alexey Spiridonov for help with computational\ninfrastructure, and Rob Fergus, Yuandong Tian and Soumith Chintala for valuable discussions.\n\n8\n\n\fReferences\n[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. PAMI, 2012. 2\n[2] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra. Object-proposal evaluation protocol is \u2019gameable\u2019.\n\narXiv:1505.05836, 2015. 8\n\n[3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with\n\ndeep convolutional nets and fully connected crfs. ICLR, 2015. 2\n\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 1\n[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\n[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks.\n\ndatabase. In CVPR, 2009. 1, 3\n\nIn CVPR, 2014. 2\n\nworks. In NIPS, 2012. 2\n\narXiv:505.02146v1, 2015. 2, 8\n\nProceedings of the IEEE, 1998. 1\n\n[7] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object\n\n[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\nclasses (VOC) challenge. IJCV, 2010. 1, 2, 6\n\ntively trained part-based models. PAMI, 2010. 1\n\n[9] R. Girshick. Fast R-CNN. arXiv:1504.08083, 2015. 2, 8\n[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014. 1, 2, 6\n\n[11] B. Hariharan, P. Arbel\u00b4aez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and \ufb01ne-\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\n[13] J. Hosang, R. Benenson, P. Doll\u00b4ar, and B. Schiele. What makes for effective detection proposals?\n\ngrained localization. In CVPR, 2015. 2\n\nrecognition. In ECCV, 2014. 2\n\narXiv:1502.05082, 2015. 2, 6\n\nRegions. In CVPR, 2014. 2, 6, 7, 8\n\n[14] A. Humayun, F. Li, and J. M. Rehg. RIGOR: Reusing Inference in Graph Cuts for generating Object\n\n[15] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian. Spatial pyramid pooling in deep convolutional net-\n\nworks for visual recognition. In ECCV, 2014. 1\n\n[16] P. Kr\u00a8ahenb\u00a8uhl and V. Koltun. Geodesic object proposals. In ECCV, 2014. 2, 6, 7, 8\n[17] P. Kr\u00a8ahenb\u00a8uhl and V. Koltun. Learning to propose objects. In CVPR, 2015. 2\n[18] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\n[19] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning objectness with convolutional networks.\n\nIn\n\n[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\n[21] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,\n\nand P. Doll\u00b4ar. Microsoft COCO: Common objects in context. arXiv:1405.0312, 2015. 1, 2, 5, 6\n\n[22] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? \u2013 Weakly-supervised learning\n\nwith convolutional neural networks. In CVPR, 2015. 2\n\n[23] P. O. Pinheiro and R. Collobert. Recurrent conv. neural networks for scene labeling. In ICML, 2014. 2\n[24] J. Pont-Tuset, P. Arbel\u00b4aez, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for\n\nimage segmentation and object proposal generation. In arXiv:1503.00848, 2015. 2, 6, 7, 8\n\n[25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In arXiv:1506.01497, 2015. 2\n\n[26] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014. 5\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015. 2, 3, 5\n\n[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 2014. 5\n\n[29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-\n\nnovich. Going deeper with convolutions. In CVPR, 2015. 2\n\n[30] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov.\n\nScalable, high-quality object detection.\n\nIn\n\n[31] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recog. IJCV, 2013.\n\narXiv:1412.1441, 2014. 2\n\n2, 6, 7, 8\n\n[32] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 2004. 1\n[33] Z. Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segdeepm: Exploiting segmentation and context\n\nin deep neural networks for object detection. In CVPR, 2015. 1\n\n[34] C. L. Zitnick and P. Doll\u00b4ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 2, 6, 7, 8\n\n9\n\n\f", "award": [], "sourceid": 1207, "authors": [{"given_name": "Pedro", "family_name": "O. Pinheiro", "institution": "EPFL / Idiap"}, {"given_name": "Ronan", "family_name": "Collobert", "institution": "Facebook"}, {"given_name": "Piotr", "family_name": "Dollar", "institution": "Facebook AI Research"}]}