{"title": "Segmenting Scenes by Matching Image Composites", "book": "Advances in Neural Information Processing Systems", "page_first": 1580, "page_last": 1588, "abstract": "In this paper, we investigate how similar images sharing the same global description can help with unsupervised scene segmentation in an image.  In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.  This allows for a better explanation of the input scenes.  We perform MRF-based segmentation that optimizes over matches, while respecting boundary information.  The recovered segments are then used to re-query a large database of images to retrieve better matches for the target region. We show improved performance in detecting occluding boundaries over previous methods on data gathered from the LabelMe database.", "full_text": "Segmenting Scenes by Matching Image Composites\n\nBryan C. Russell1 Alexei A. Efros2,1 Josef Sivic1 William T. Freeman3 Andrew Zisserman4,1\n\n\u2217\n1INRIA\n\n2Carnegie Mellon University\n\n3CSAIL MIT\n\n4University of Oxford\n\nAbstract\n\nIn this paper, we investigate how, given an image, similar images sharing the same\nglobal description can help with unsupervised scene segmentation. In contrast\nto recent work in semantic alignment of scenes, we allow an input image to be\nexplained by partial matches of similar scenes. This allows for a better explanation\nof the input scenes. We perform MRF-based segmentation that optimizes over\nmatches, while respecting boundary information. The recovered segments are then\nused to re-query a large database of images to retrieve better matches for the target\nregions. We show improved performance in detecting the principal occluding and\ncontact boundaries for the scene over previous methods on data gathered from the\nLabelMe database.\n\n1 Introduction\n\nSegmenting semantic objects, and more broadly image parsing, is a fundamentally challenging prob-\nlem. The task is painfully under-constrained \u2013 given a single image, it is extremely dif\ufb01cult to parti-\ntion it into semantically meaningful elements, not just blobs of similar color or texture. For example,\nhow would the algorithm \ufb01gure out that doors and windows on a building, which look quite differ-\nent, belong to the same segment? Or that the grey pavement and a grey house next to it are different\nsegments? Clearly, information beyond the image itself is required to solve this problem.\n\nIn this paper, we argue that some of this extra information can be extracted by also considering\nimages that are visually similar to the given one. With the increasing availability of Internet-\nscale image collections (in the millions of images!), this idea of data-driven scene matching has\nrecently shown much promise for a variety of tasks. Simply by \ufb01nding matching images using a\nlow-dimentinal descriptor and transfering any associated labels onto the input image, impressive re-\nsults have been demonstrated for object and scene recognition [22], object detection [18, 11], image\ngeo-location [7], and particular object and event annotation [15], among others. Even if the image\ncollection does not contain any labels, it has been shown to help tasks such as image completion and\nexploration [6, 21], image colorization [22], and 3D surface layout estimation [5].\n\nHowever, as noted by several authors and illustrated in Figure 1, the major stumbling block of all\nthe scene-matching approaches is that, despite the large quantities of data, for many types of im-\nages the quality of the matches is still not very good. Part of the reason is that the low-level image\ndescriptors used for matching are just not powerful enough to capture some of the more semantic\nsimilarity. Several approaches have been proposed to address this shortcoming, including syntheti-\ncally increasing the dataset with transformed copies of images [22], cleaning matching results using\nclustering [18, 7, 5], automatically pre\ufb01ltering the dataset [21], or simply picking good matches by\nhand [6]. All these appraoches improve performance somewhat but don\u2019t alleviate this issue entirely.\nWe believe that there is a more fundamental problem \u2013 the variability of the visual world is just so\nvast, with exponential number of different object combinations within each scene, that it might be\n\nWILLOW project-team, Laboratoire d\u2019Informatique de l\u2019 \u00b4Ecole Normale Sup\u00b4erieure ENS/INRIA/CNRS\n\n\u2217\n\nUMR 8548\n\n1\n\n\fFigure 1: Illustration of the scene matching problem. Left: Input image (along with the output\nsegmentation given by our system overlaid) to be matched to a dataset of 100k street images. Notice\nthat the output segment boundaries align well with the depicted objects in the scene. Top right:\ntop three retrieved images, based on matching the gist descriptor [14] over the entire image. The\nmatches are not good. Bottom right: Searching for matches within each estimated segment (using\nthe same gist representation within the segment) and compositing the results yields much better\nmatches to the input image.\n\nfutile to expect to always \ufb01nd a single overall good match at all! Instead, we argue that an input\nimage should be explained by a spatial composite of different regions taken from different database\nimages. The aim is to break-up the image into chunks that are small enough to have good matches\nwithin the database, but still large enough that the matches retain their informative power.\n\n1.1 Overview\n\nIn this work, we propose to apply scene matching to the problem of segmenting out semantically\nmeaningful objects (i.e. we seek to segment objects enclosed by the principal occlusion and contact\nboundaries and not objects that are part-of or attached to other objects). The idea is to turn to\nour advantage the fact that scene matches are never perfect. What typically happens during scene\nmatching is that some part of the image is matched quite well, while other parts are matched only\napproximately, at a very coarse level. For example, for a street scene, one matching image could\nhave a building match very well, but getting the shape of the road wrong, while another matching\nimage could get the road exactly right, but have a tree instead of a building. These differences in\nmatching provide a powerful signal to identify objects and segmentation boundaries. By computing\na matching image composite, we should be able to better explain the input image (i.e. match each\nregion in the input image to semantically similar regions in other images) than if we used a single\nbest match.\n\nThe starting point of our algorithm is an input image and an \u201cimage stack\u201d \u2013 a set of coarsely\nmatching images (5000 in our case) retrieved from a large dataset using a standard image matching\ntechnique (gist [14] in our case). In essence, the image stack is itself a dataset, but tailor-made to\nmatch the overall scene structure for the particular input image. Intuitively, our goal is to use the\nimage stack to segment (and \u201cexplain\u201d) the input image in a semantically meaningful way. The idea\nis that, since the stack is already more-or-less aligned, the regions corresponding to the semantic ob-\njects that are present in many images will consistently appear in the same spatial location. The input\nimage can then be explained as a patch-work of these consistent regions, simultaneously producing\na segmentation, as well as composite matches, that are better than any of the individual matches\nwithin the stack.\n\nThere has been prior work on producing a resulting image using a stack of aligned images depicting\nthe same scene, in particular the PhotoMontage work [1], which optimally selects regions from the\nglobally aligned images based on a quality score to composite a visually pleasing output image.\nRecently, there has been work based on the PhotoMontage framework that tries to automatically\nalign images depicting the same scene or objects to perform segmentation [16], region-\ufb01lling [23],\nand outlier detection [10]. In contrast, in this work, we are attempting to work on a stack of visually\nsimilar, but physically different, scenes. This is in the same spirit as the contemporary work of [11],\n\n2\n\n\fexcept they work on supervised data, whereas we are completely unsupervised. Also related is the\ncontemporary work of [9].\n\nOur approach combines boundary-based and region-based segmentation processes together within\na single MRF framework. The boundary process (Section 2) uses the stack to determine the likely\nsemantic boundaries between objects. The region process (Section 3) aims to group pixels belonging\nto the same object across the stack. These cues are combined together within an MRF framework\nwhich is solved using GraphCut optimization (Section 4). We present results in Section 5.\n\n2 Boundary process: data driven boundary detection\n\nInformation from only a single image is in many cases not suf\ufb01cient for recovering boundaries be-\ntween objects. Strong image edges could correspond to internal object structures, such as a window\nor a wheel of a car. Additionally, boundaries between objects often produce weak image evidence,\nas for example the boundary between a building and road of similar color partially occluding each\nother.\n\nHere, we propose to analyze the statistics of a large number of related images (the stack) to help\nrecover boundaries between objects. We will exploit the fact that objects tend not to rest at exactly\nthe same location relative to each other in a scene. For example, in a street scene, a car may be\nadjacent to regions belonging to a number of objects, such as building, person, road, etc. On the\nother hand, relative positions of internal object structures will be consistent across many images. For\nexample, wheels and windows on a car will appear consistently at roughly similar positions across\nmany images.\n\nTo recover object boundaries, we will measure the ability to consistently match locally to the same\nset of images in the stack. Intuitively, regions inside an object will tend to match to the same set of\nimages, each having similar appearance, while regions on opposite sides of a boundary will match to\ndifferent sets of images. More formally, given an oriented line passing through an image point p at\norientation \u03b8, we wish to analyze the statistics of two sets of images with similar appearance on each\nside of the line. For each side of the oriented line, we independently query the stack of images by\nforming a local image descriptor modulated by a weighted mask. We use a half-Gaussian weighting\nmask oriented along the line and centered at image point p. This local mask modulates the Gabor\n\ufb01lter responses (8 orientations over 4 scales) and the RGB color channels, with a descriptor formed\nby averaging the Gabor energy and color over 32\u00d732 pixel spatial bins. The Gaussian modulated\ndescriptor g(p, \u03b8) captures the appearance information on one side of the boundary at point p and\norientation \u03b8. Appearance descriptors extracted in the same manner across the image stack are\ncompared with the query image descriptor using the L1 distance. Images in the stack are assumed\nto be coarsely aligned, and hence matches are considered only at the particular query location p\nand orientation \u03b8 across the stack, i.e. matching is not translation invariant. We believe this type of\nspatially dependent matching is suitable for scene images with consistent spatial layout considered\nin this work. The quality of the matches can be further improved by \ufb01ne aligning the stack images\nwith the query [12].\nFor each image point p and orientation \u03b8, the output of the local matching on the two sides of the\noriented line are two ranked lists of image stack indices, S r and Sl, where the ordering of each list is\ngiven by the L1 distance between the local descriptors g(p, \u03b8) of the query image and each image in\nthe stack. We compute Spearman\u2019s rank correlation coef\ufb01cient between the two rank-ordered lists\n\n\u03c1(p, \u03b8) = 1 \u2212 6\n\n(cid:1)n\ni=1 d2\ni\nn(n2 \u2212 1)\n\n,\n\n(1)\n\nwhere n is the number of images in the stack and d i is the difference between ranks of the stack\nimage i in the two ranked lists, Sr and Sl. A high rank correlation should indicate that point p\nlies inside an object\u2019s extent, whereas a low correlation should indicate that point p is at an object\nboundary with orientation \u03b8. We note however, that low rank correlations could be also caused by\npoor quality of local matches. Figure 2 illustrates the boundary detection process.\n\nFor ef\ufb01ciency reasons, we only compute the rank correlation score along points and orientations\nmarked as boundaries by the probability of boundary edge detector (PB) [13], with boundary ori-\nentations \u03b8 \u2208 [0, \u03c0) quantized in steps of \u03c0/8. The \ufb01nal boundary score P DB of the proposed data\n\n3\n\n\fA\n\nB\n\nA\n\nB\n\nFigure 2: Data driven boundary detection. Left: Input image with query edges shown. Right: The\ntop 9 matches in a large collection of images for each side of the query edges. Rank correlation for\nocclusion boundary (A): -0.0998; rank correlation within the road region (B): 0.6067. Notice that\nfor point B lying inside an object (the road), the ranked sets of retrieved images for the two sides\nof the oriented line are similar, resulting in a high rank correlation score. At point A lying at an\nocclusion boundary between the building and the sky, the sets of retrieved images are very different,\nresulting in a low rank correlation score.\n\ndriven boundary detector is a gating of the maximum PB response over all orientations, P B, and the\nrank correlation coef\ufb01cient \u03c1,\n\nPDB(p, \u03b8) = PB(p, \u03b8)\n\n\u03b4[PB(p, \u03b8) = max\n\n\u00af\u03b8\n\nPB(p, \u00af\u03b8)].\n\n(2)\n\n1 \u2212 \u03c1(p, \u03b8)\n\n2\n\nNote that this type of data driven boundary detection is very different from image based edge detec-\ntion [4, 13] as (i) strong image edges can receive a low score provided the matched image structures\non each side of the boundary co-occur in many places in the image collection, and (ii) weak im-\nage edges can receive a high score, provided the neighboring image structures on each side of the\nweak image boundary do not co-occur often in the database. In contrast to the PB detector, which\nis trained from manually labelled object boundaries, data driven boundary scores are determined\nbased on co-occurrence statistics of similar scenes and require no additional manual supervision.\nFigure 3 shows examples of data driven boundary detection results. Quantitative evaluation is given\nin section 5.\n\n3 Region process: data driven image grouping\n\nThe goal is to group pixels in a query image that are likely to belong to the same object or a major\nscene element (such as a building, a tree, or a road). Instead of relying on local appearance similarity,\nsuch as color or texture, we again turn to the dataset of scenes in the image stack to suggest the\ngroupings.\n\nOur hypothesis is that regions corresponding to semantically meaningful objects would be coherent\nacross a large part of the stack. Therefore, our goal is to \ufb01nd clusters within the stack that are both\n(i) self-consistent, and (ii) explain well the query image. Note that for now, we do not want to make\nany hard decisions, therefore, we want to allow multiple clusters to be able to explain overlapping\nparts of the query image. For example, a tree cluster and a building cluster (drawn from different\nparts of the stack) might be able to explain the same patch of the image, and both hypotheses should\nbe retained. This way, the \ufb01nal segmentation step in the next section will be free to chose the best\nset of clusters based on all the information available within a global framework.\n\nTherefore our approach is to \ufb01nd clusters of image patches that match the same images within the\nstack. In other words, two patches in the query image will belong to the same group if the sets\nof their best matching images from the database are similar. As in the boundary process described\nin section 2, the query image is compared with each database image only at the particular query\npatch location, i.e. the matching is not translation invariant. Note that patches with very different\nappearance can be grouped together as long as they match the same database images. For example, a\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Data driven boundary detection. (a) Input image. (b) Ground truth boundaries. (c) P B [13].\n(d) Proposed data driven boundary detection. Notice enhanced object boundaries and suppressed\nfalse positives boundaries inside objects.\n\ndoor and a window of a building can be grouped together despite their different shape and appearance\nas long as they co-occur together (and get matched) in other images. This type of matching is\ndifferent from self-similarity matching [20] where image patches within the same image are grouped\ntogether if they look similar.\nFormally, given a database of N scene images, each rectangular patch in the query image is described\nby an N dimensional binary vector, y, where the i-th element y [i] is set to 1 if the i-th image in the\ndatabase is among the m = 1000 nearest neighbors of the patch. Other elements of y are set to 0.\nThe nearest neighbors for each patch are obtained by matching the local gist and color descriptors at\nthe particular image location as described in section 2, but here center weighted by a full Gaussian\nmask with \u03c3 = 24 pixels.\nWe now wish to \ufb01nd cluster centers ck for k \u2208 {1, . . . , K}. Many methods exist for \ufb01nding clusters\nin such space. For example, one can think of the desired object clusters as \u201ctopics of an image stack\u201d\nand apply one of the standard topic discovery methods like probabilistic latent semantic analysis\n(pLSA) [8] or Latent Dirichlet Allocation (LDA) [2]. However, we found that a simple K-means\nalgorithm applied to the indicator vectors produced good results. Clearly, the number of clusters,\nK, is an important parameter. Because we are not trying to discover all the semantic objects within\na stack, but only those that explain well the query image, we found that a relatively small number\nof clusters (e.g. 5) is suf\ufb01cient. Figure 4 shows heat maps of the similarity (measured as c T\nk y) of\neach binary vector to the recovered cluster centers. Notice that regions belonging to the major scene\ncomponents are highlighted. Although hard K-means clustering is applied to cluster patches at this\nstage, a soft similarity score for each patch under each cluster is used in a segmentation cost function\nincorporating both region and boundary cues, described next.\n\n4 Image segmentation combining boundary and region cues\n\nIn the preceding two sections we have developed models for estimating data-driven scene bound-\naries and coherent regions from the image stack. Note that while both the boundary and the region\nprocesses use the same data, they are in fact producing very different, and complementary, types of\ninformation. The region process aims to \ufb01nd large groups of coherent pixels that co-occur together\noften, but is not too concerned about precise localization. The boundary process, on the other hand,\nfocuses rather myopically on the local image behavior around boundaries but has excellent localiza-\n\n5\n\n\fFigure 4: Data driven image grouping. Left: input image. Right: heat maps indicating groupings\nof pixels belonging to the same scene component, which are found by clustering image patches that\nmatch the same set of images in the stack (warmer colors correspond to higher similarity to a cluster\ncenter). Notice that regions belonging to the major scene components are highlighted. Also, local\nregions with different appearances (e.g. doors and windows in the interior of the building) can map\nto the same cluster since they only need to match to the same set of images. Finally, the highlighted\nregions tend to overlap, thereby providing multiple hypotheses for a local region.\n\ntion. Both pieces of information are needed for a successful scene segmentation and explanation.\nIn this section, we propose to use a single MRF-based optimization framework for this task, that\nwill negotiate between the more global region process and the well-localized boundary process. We\nset up a multi-state MRF on pixels for segmentation, where the states correspond to the K different\nimage stack groups from section 3. The MRF is formulated as follows:\n\n(cid:2)\n\nmin\nx\n\n(cid:2)\n\n\u03c6i(xi, yi) +\n\n\u03c8i,j(xi, xj)\n\n(3)\n\ni\n\n(i,j)\n\nwhere xi \u2208 {0, 1, . . . , K} is the state at pixel i corresponding to one of K different image stack\ngroups (section 3), \u03c6i are unary costs de\ufb01ned by similarity of a patch at pixel i, described by an\nindicator vector yi (section 3), to each of the K image stack groups, and \u03c8 i,j are binary costs for\na boundary-dependent Potts model (section 2). We also allow an additional outlier state x i = 0\nfor regions that do not match any of the clusters well. For the pairwise term, we assume a 4-\nneighbourhood structure, i.e. the extent is over adjacent horizontal and vertical neighbors. The\nunary term in Equation 3 encourages pixels explained well by the same group of images from the\nstack to receive the same label. The binary term encourages neighboring pixels to have the same\nlabel, except in a case of a strong boundary evidence.\n\nIn more details, the unary term is given by\n\n(cid:3) \u2212s(ck, yi) k \u2208 {1, . . . , K}\n\nk = 0\n\n\u03c6i(xi = k, yi) =\n\n\u03b3\n\n(4)\n\nwhere \u03b3 is a scalar parameter, and s(ck, yi) = cT\ndescribing the local image appearance at pixel i (section 3) and the k-th cluster center c k.\nThe pairwise term is de\ufb01ned as\n\nk yi is the similarity between indicator vector y i\n\n\u03c8i,j(xi, xj) = (\u03b1 + \u03b2f(i, j)) \u03b4[xi = xj]\n\n(5)\n\nwhere f(i, j) is a function dependent on the output of the data-driven boundary detector P DB (Equa-\ntion 2), and \u03b1 and \u03b2 are scalar parameters. Since P DB is a line process with output strength and\norientation de\ufb01ned at pixels rather than between pixels, as in the standard contrast dependent pair-\nwise term [3], we must take care to place the pairwise costs consistently along one side of each\ncontinuous boundary. For this, let P i = max\u03b8 PDB(i, \u03b8) and \u03b8i = argmax\u03b8 PDB(i, \u03b8). If i and\nj are vertical neighbors, with i on top, then f(i, j) = max{0, P j \u2212 Pi}. If i and j are horizontal\nneighbors, with i on the left, then f(i, j) = max{0, (P j \u2212 Pi)\u03b4[\u03b8j < \u03c0/2], (Pi \u2212 Pj)\u03b4[\u03b8i \u2265 \u03c0/2]}.\nNotice that since PDB is non-negative everywhere, we only incorporate a cost into the model when\nthe difference between adjacent PDB elements is positive.\nWe minimize Equation (3) using graph cuts with alpha-beta swaps [3]. We optimized the parameters\non a validation set by manual tuning on the boundary detection task (section 5). We set \u03b1 = \u22120.1,\n\u03b2 = 0.25, and \u03b3 = \u22120.25. Note that the number of recovered segments is not necessarily equal to\nthe number of image stack groups K.\n\n6\n\n\f \n\nPB\nData\u2212driven detector (PDB)\nSegmentation\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n0\n\n \n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\nRecall\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n5 Experimental evaluation\n\nFigure 5: Evaluation of the boundary detection task on the\nprincipal occlusion and contact boundaries extracted from\nthe LabelMe database [17]. We show precision-recall curves\nfor PB [13] (blue triangle line) and our data-driven bound-\nary detector (red circle line). Notice that we achieve im-\nproved performance across all recalls. We also show the\nprecision and recall of the output segmentations (green star),\nwhich achieves 0.55 precision at 0.09 recall. At the same re-\ncall level, PB and the data-driven boundary detector achieves\n0.45 and 0.50 precision, respectively.\n\nIn this section, we evaluate the data-driven boundary detector and the proposed image segmentation\nmodel on a challenging dataset of complex street scenes from the LabelMe database [19]. For the\nunlabelled scene database, we use a dataset of 100k street scene images gathered from Flickr [21].\nBoundary detection and image grouping are then applied only within this candidate set of images.\n\nFigure 6 shows several \ufb01nal segmentations. Notice that the recovered segments correspond to the\nlarge objects depicted in the images, with the segment boundaries aligning along the objects\u2019 bound-\naries. For each segment, we re-query the image stack by using the segment as a weighted mask to\nretrieve images that match the appearance within the segment. The top matches for each segment\nare stitched together to form a composite, which are shown in Figure 6. As a comparison, we show\nthe top matches using the global descriptor. Notice that the composites better align with the contents\ndepicted in the input image.\n\nWe quantitatively evaulate our system by measuring how well we can detect ground truth object\nboundaries provided by human labelers. To evaluate object boundary detection, we use 100 images\ndepicting street scenes from the benchmark set of the LabelMe database [19]. The benchmark set\nconsists of fully labeled images taken from around the world. A number of different types of edges\nare implicitly labeled in the LabelMe database, such as those arising through occlusion, attachment,\nand contact with the ground. For this work, we \ufb01lter out attached objects (e.g. a window is attached\nto a building and hence does not generate any object boundaries) using the techniques outlined\nin [17]. Note that this benchmark is more appropriate for our task than the BSDS [13] since the\ndataset explicitly contains occlusion boundaries and not interior contours.\n\nTo measure performance, we used the evaluation procedure outlined in [13], which aligns output\nboundaries for a given threshold to the ground truth boundaries to compute precision and recall.\nA curve is generated by evaluating at all thresholds. For a boundary to be considered correct, we\nassume that it must lie within 6 pixels of the ground truth boundary.\n\nFigure 5 shows a precision-recall curve for the data-driven boundary detector. We compare against\nPB using color [13]. Notice that we achieve higher precision at all recall levels. We also plot the\nprecision and recall of the output segmentation produced by our system. Notice that the segmenta-\ntion produced the highest precision (0.55) at 0.09 recall. The improvement in performance at low\nrecall is largely due to the ability to suppress interior contours due to attached objects (c.f. Figure 3).\nHowever, we tend to miss small, moveable objects, which accounts for the lower performance at\nhigh recall.\n\n6 Conclusion\n\nWe have shown that unsupervised analysis of a large image collection can help segmenting complex\nscenes into semantically coherent parts. We exploit object variations over related images using\nMRF-based segmentation that optimizes over matches while preserving scene boundaries obtained\nby a data driven boundary detection process. We have demonstrated an improved performance in\ndetecting the principal occlusion and contact boundaries over previous methods on a challenging\ndataset of complex street scenes from LabelMe. Our work also suggests that other applications of\n\n7\n\n\fFigure 6: Left: Output segmentation produced by our system. Notice that the segment boundaries\nalign well with the depicted objects in the scene. Top right: Top matches for each recovered segment,\nwhich are stitched together to form a composite. Bottom right: Top whole-image matches using the\ngist descriptor. By recovering the segmentation, we are able to recover improved semantic matches.\n\nscene matching, such as object recognition or computer graphics, might bene\ufb01t from segment-based\nexplanations of the query scene.\nAcknowledgments: This work was partially supported by ONR MURI N00014-06-1-0734, ONR\nMURI N00014-07-1-0182, NGA NEGI-1582-04-0004, NSF grant IIS-0546547, gifts from Mi-\ncrosoft Research and Google, and Guggenheim and Sloan fellowships.\n\n8\n\n\fReferences\n\n[1] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen.\n\nInteractive digital photomontage. In SIGGRAPH, 2004.\n\n[2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans.\n\non Pattern Analysis and Machine Intelligence, 23(11), 2001.\n\n[4] J. F. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine\n\nIntelligence, 8(6):679\u2013698, 1986.\n\n[5] S. K. Divvala, A. A. Efros, and M. Hebert. Can similar scenes help surface layout estimation? In IEEE\n\nWorkshop on Internet Vision, associated with CVPR, 2008.\n\n[6] J. Hays and A. Efros. Scene completion using millions of photographs. In \u201dSIGGRAPH\u201d, 2007.\n[7] J. Hays and A. A. Efros. IM2GPS: estimating geographic information from a single image. In CVPR,\n\n2008.\n\n[8] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 43:177\u2013\n\n196, 2001.\n\n[9] M. K. Johnson, K. Dale, S. Avidan, H. P\ufb01ster, W. T. Freeman, and W. Matusik. CG2Real: Improving\nthe realism of computer-generated images using a large collection of photographs. Technical Report\n2009-034, MIT CSAIL, 2009.\n\n[10] H. Kang, A. A. Efros, M. Hebert, and T. Kanade.\n\nImage composition for object pop-out.\n\nWorkshop on 3D Representation for Recognition (3dRR-09), in assoc. with CVPR, 2009.\n\nIn IEEE\n\n[11] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: label transfer via dense scene alignment.\n\nIn CVPR, 2009.\n\n[12] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. SIFT \ufb02ow: dense correspondence across\n\ndifferent scenes. In ECCV, 2008.\n\n[13] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness,\ncolor, and texture cues. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(5):530\u2013549, 2004.\n[14] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-\n\nlope. IJCV, 42(3):145\u2013175, 2001.\n\n[15] T. Quack, B. Leibe, and L. V. Gool. World-scale mining of objects and events from community photo\n\ncollections. In CIVR, 2008.\n\n[16] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram matching\n\n- incorporating a global constraint into MRFs. In CVPR, 2006.\n\n[17] B. C. Russell and A. Torralba. Building a database of 3D scenes from user annotations. In CVPR, 2009.\n[18] B. C. Russell, A. Torralba, C. Liu, R. Fergus, and W. T. Freeman. Object recognition by scene alignment.\n\nIn Advances in Neural Info. Proc. Systems, 2007.\n\n[19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool\n\nfor image annotation. IJCV, 77(1-3):157\u2013173, 2008.\n\n[20] E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In CVPR, 2007.\n[21] J. Sivic, B. Kaneva, A. Torralba, S. Avidan, and W. T. Freeman. Creating and exploring a large photore-\n\nalistic virtual space. In First IEEE Workshop on Internet Vision, associated with CVPR, 2008.\n\n[22] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric\nobject and scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(11):1958\u2013\n1970, 2008.\n\n[23] O. Whyte, J. Sivic, and A. Zisserman. Get out of my picture!\n\nMachine Vision Conference, 2009.\n\nInternet-based inpainting.\n\nIn British\n\n9\n\n\f", "award": [], "sourceid": 1002, "authors": [{"given_name": "Bryan", "family_name": "Russell", "institution": null}, {"given_name": "Alyosha", "family_name": "Efros", "institution": null}, {"given_name": "Josef", "family_name": "Sivic", "institution": null}, {"given_name": "Bill", "family_name": "Freeman", "institution": null}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": null}]}