{"title": "Mid-level Visual Element Discovery as Discriminative Mode Seeking", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 502, "abstract": "Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical visual words\", but lower than full-blown semantic objects. Several approaches have been proposed to discover mid-level visual elements, that are both 1) representative, i.e. frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset.\"", "full_text": "Mid-level Visual Element Discovery\n\nas Discriminative Mode Seeking\n\nCarl Doersch\n\nCarnegie Mellon University\ncdoersch@cs.cmu.edu\n\nAbhinav Gupta\n\nCarnegie Mellon University\nabhinavg@cs.cmu.edu\n\nAlexei A. Efros\nUC Berkeley\n\nefros@cs.berkeley.edu\n\nAbstract\n\nRecent work on mid-level visual representations aims to capture information at the\nlevel of complexity higher than typical \u201cvisual words\u201d, but lower than full-blown\nsemantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover\nmid-level visual elements, that are both 1) representative, i.e., frequently occurring\nwithin a visual dataset, and 2) visually discriminative. However, the current ap-\nproaches are rather ad hoc and dif\ufb01cult to analyze and evaluate. In this work,\nwe pose visual element discovery as discriminative mode seeking, drawing con-\nnections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8].\nGiven a weakly-labeled image collection, our method discovers visually-coherent\npatch clusters that are maximally discriminative with respect to the labels. One\nadvantage of our formulation is that it requires only a single pass through the data.\nWe also propose the Purity-Coverage plot as a principled way of experimentally\nanalyzing and evaluating different visual discovery approaches, and compare our\nmethod against prior work on the Paris Street View dataset of [5]. We also eval-\nuate our method on the task of scene classi\ufb01cation, demonstrating state-of-the-art\nperformance on the MIT Scene-67 dataset.\n\nIntroduction\n\n1\nIn terms of sheer size, visual data is, by most accounts, the biggest \u201cBig Data\u201d out there. But,\nunfortunately, most machine learning algorithms (with some notable exceptions, e.g. [13]) are not\nequipped to handle it directly, at the raw pixel level, making research on \ufb01nding good visual rep-\nresentations particularly relevant and timely. Currently, the most popular visual representations in\nmachine learning are based on \u201cvisual words\u201d [24], which are obtained by unsupervised clustering\n(k-means) of local features (SIFT) over a large dataset. However, \u201cvisual words\u201d is a very low-level\nrepresentation, mostly capturing local edges and corners ([21] notes that \u201cvisual letters\u201d or \u201cvisual\nphonemes\u201d would have been a more accurate term). Part of the problem is that the local SIFT fea-\ntures are relatively low-dimensional (128D), and might not be powerful enough to capture anything\nof higher complexity. However, switching to a more descriptive feature (e.g. 2, 000-dimensional\nHOG) causes k-means to produce visually poor clusters due to the curse of dimensionality [5].\nRecently, several approaches [5, 6, 11, 12, 15, 23, 26, 27] have proposed mining visual data for dis-\ncriminative mid-level visual elements, i.e., entities which are more informative than \u201cvisual words,\u201d\nand more frequently occurring and easier to detect than high-level objects. Most such approaches\nrequire some form of weak per-image labels, e.g., scene categories [12] or GPS coordinates [5] (but\ncan also run unsupervised [23]), and have been recently used for tasks including image classi\ufb01cation\n[12, 23, 27], object detection [6], visual data mining [5, 15], action recognition [11], and geometry\nestimation [7]. But how are informative visual elements to be identi\ufb01ed in the weakly-labeled vi-\nsual dataset? The idea is to search for clusters of image patches that are both 1) representative, i.e.\nfrequently occurring within the dataset, and 2) visually discriminative. Unfortunately, algorithms\nfor \ufb01nding patches that \ufb01t these criteria remain rather ad-hoc and poorly understood. and often\ndo not even directly optimize these criteria. Hence, our goal in this work is to quantify the terms\n\u201crepresentative\u201d and \u201cdiscriminative,\u201d and show that a formulation which draws inspiration from\n\n1\n\n\fFigure 1: The distribution of patches in HOG feature space is very non-uniform and absolute distances cannot\nbe trusted. We show two patches with their 5 nearest-neighbors from the Paris Street View dataset [5]; beneath\neach nearest neighbor is its distance from query. Although the nearest neighbors on the left are visually much\nbetter, their distances are more than twice those on the right, meaning that the actual densities of the two regions\nwill differ by a factor of more than 2d, where d is the intrinsic dimensionality of patch feature space. Since this\nis a 2112-dimensional feature space, we estimate d to be on the order of hundreds.\nthe well-known, well-understood mean-shift algorithm can produce visual elements that are more\nrepresentative and discriminative than those of previous approaches.\nMining visual elements from a large dataset is dif\ufb01cult for a number of reasons. First, the search\nspace is huge: a typical dataset for visual data mining has tens of thousands of images, and \ufb01nding\nsomething in an image (e.g., \ufb01nding matches for a visual template) involves searching across tens\nof thousands of patches at different positions and scales. To make matters worse, patch descriptors\ntend to be on the order of thousands of dimensions; not only is the curse of dimensionality a constant\nproblem, but we must sift through terabytes of data. And we are searching for a needle in a haystack:\nthe vast majority of patches are actually uninteresting, either because they are rare (e.g., they may\ncontain multiple random things in a con\ufb01guration that never occurs again) or they are redundant due\nto the overlapping nature of patches. This suggests the need for an online algorithm, because we\nwish to discard much of the data while making as few passes through the dataset as possible.\nThe well-known mean-shift algorithm [2, 3, 8] has been proposed to address many of these problems.\nThe goal of mean-shift is to \ufb01nd the local maxima (modes) of a density using a sample from that\ndensity. Intuitively, mean-shift initializes each cluster centroid to a single data point, then iteratively\n1) \ufb01nds data points that are suf\ufb01ciently similar to each centroid, and, 2) averages these data points\nto update the cluster centroid. In the end, each cluster generally depends on only a tiny fraction of\nthe data, thus eliminating the need to keep the entire dataset in memory.\nHowever, there is one issue with using classical mean-shift to solve our problem directly: it only\n\ufb01nds local maxima of a single, unlabeled density, which may not be discriminative. But in our\ncase, we can use the weak labels to divide our data into two different subsets (\u201cpositive\u201d (+) and\n\u201cnegative\u201d ()) and seek visual elements which appear only in the \u201cpositive\u201d set and not in the\n\u201cnegative\u201d set. That is, we want to \ufb01nd points in feature space where the density of the positive\nset is large, and the density of the negative set is small. This can be achieved by maximizing the\nwell-studied density ratio p+(x)/p(x) instead of maximizing the density. While a number of\nalgorithms exist for estimating ratios of densities (see [25] for a review), we did not \ufb01nd any that\nwere particularly suitable for \ufb01nding local maxima of density ratios. Hence, the \ufb01rst contribution of\nour paper is to propose a discriminative variant of mean-shift for \ufb01nding visual elements. Similar to\nthe way mean-shift performs gradient ascent on a density estimate, our algorithm performs gradient\nascent on the density ratio (section 2). When we perform gradient ascent separately for each element\nas in standard mean-shift, however, we \ufb01nd that the most frequently-occuring elements tend to\nbe over-represented. Hence, section 3 describes a modi\ufb01cation to our gradient ascent algorithm\nwhich uses inter-element communication to approximate common adaptive bandwidth procedures.\nFinally, in section 4 we demonstrate that our algorithms produce visual elements which are more\nrepresentative and discriminative than previous methods, and in section 5 we show they signi\ufb01cantly\nimprove performance in scene classi\ufb01cation.\n\n2 Mode Seeking on Density Ratios\nOur goal is to extract discriminative visual elements by \ufb01nding the local maxima of the density ratio.\nHowever, one issue with performing gradient ascent directly on standard density ratio estimates is\nthat common estimators tend to use a \ufb01xed kernel bandwidth, for example:\n\n\u02c6r(x) /\n\nnXi=1\n\n\u2713iK(kx  xik/h)\n\nwhere \u02c6r is the ratio estimate, the parameters \u2713i 2 R are weights associated with each datapoint,\nK is a kernel function (e.g., a Gaussian), and h is a globally-shared bandwidth parameter. The\n\n2\n\n2.58 2.92 3.07 3.10 3.16 1.01 1.13 1.13 1.15 1.17 Distance: Distance: \fbandwidth de\ufb01nes how much the density is smoothed before gradient ascent is performed, meaning\nthese estimators assume a roughly equal distribution of points in all regions of the space. Unfortu-\nnately, absolute distances in HOG feature space cannot be trusted, as shown in Figure 1: any kernel\nbandwidth which is large enough to work well in the left example will be far too large to work well\nin the right. One way to deal with the non-uniformity of the feature space is to use an adaptive\nbandwidth [4]: that is, different bandwidths are used in different regions of the space. However,\nprevious algorithms are dif\ufb01cult to implement for large data in high-dimensional spaces; [4], for in-\nstance, requires a density estimate for every point used in computing the gradient of their objective,\nbecause their formulation relies on a per-point bandwidth rather than a per-cluster bandwidth. In\nour case, this is prohibitively expensive. While approximations exist [9], they rely on approximate\nnearest neighbor algorithms, which work for low-dimensional spaces (\uf8ff 48 dimensions in [9]), but\nempirically we have found poor performance in HOG feature space (> 2000 dimensions). Hence,\nwe take a different approach which we have tailored for density ratios.\nWe begin by using a result from [2] that classical mean-shift (using a \ufb02at kernel) is equivalent to\n\ufb01nding the local maxima of the following density estimate:\n\nPn\n\ni=1 max(b  d(xi, w), 0)\n\nz(b)\n\nIn standard mean-shift, d is the Euclidean distance function, b is a constant that controls the kernel\nbandwidth, and z(b) is a normalization constant. Here, the \ufb02at kernel has been replaced by its\nshadow kernel, the triangular kernel, using Theorem 1 from [2]. We want to maximize the density\nratio, so we simply divide the two density estimates. We allow an adaptive bandwidth, but rather\nthan associating a bandwidth with each datapoint, we compute it as a function of w which depends\non the data.\n\nPnpos\ni=1 max(B(w)  d(x+\ni , w), 0)\nPnneg\ni=1 max(B(w)  d(xi , w), 0)\nnnegXi=1\n\nmax(b  d(xi , w), 0) = \n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nWhere the normalization term z(b) is cancelled. This expression, however, produces poor estimates\nof the ratio if the denominator is allowed to shrink to zero; in fact, it can produce arbitrarily large\nbut spurious local maxima. Hence, we de\ufb01ne B(w) as the value of b which satis\ufb01es:\n\nWhere  is a constant analogous to the bandwidth parameter, except that it directly controls how\nmany negative datapoints are in each cluster. Note the value of the sum is strictly increasing in b\nwhen it is nonzero, so the b satisfying the constraint is unique. With this de\ufb01nition of B(w), we are\nactually \ufb01xing the value of the denominator of (2) (We include the denominator here only to make\nthe ratio explicit, and we will drop it in later formula). This approach makes the implicit assumption\nthat the distribution of the negatives captures the overall density of the patch space. Note that if\nwe assume the denominator distribution is uniform, then B(w) becomes \ufb01xed and our objective is\nidentical to \ufb01xed-bandwidth mean-shift.\nReturning to our formulation, we must still choose the distance function d. In high-dimensional\nfeature space, [20] suggests that normalized correlation provides a better metric than the Euclidean\ndistance commonly used in mean-shift. Formulations of mean-shift exist for data constrained to\nthe unit sphere [1], but again we must adapt them to the ratio setting. Surprisingly, replacing the\nEuclidean distance with normalized correlation leads to a simpler optimization problem. First, we\nmean-subtract and normalize all datapoints xi and rewrite (2) as:\n\nmax(w>x+\n\ni  b, 0) s.t. Pnneg\n\nnposXi=1\n\ni=1 max(w>xi  b, 0) = \n\nkwk2 = 1\n\nWhere B(w) has been replaced by b as in equation (3), to emphasize that we can treat B(w) as a\nconstraint in an optimization problem. We can further rewrite the above equation as \ufb01nding the local\nmaxima of:\n\nnposXi=1\n\nmax(w>x+\n\ni  b, 0)  kwk2 s.t.\n\nnnegXi=1\n\nmax(w>xi  b, 0) = \n\n(5)\n\n3\n\n\fini#al&\n\nini#al&\n\nini#al&\n\nFirst&\n\nItera#on&\n\nFinal&&\nItera#on&\nFigure 2: Left: without competition, the algorithm from section 2 correctly learns a street lamp element.\nMiddle: The same algorithm trained on a sidewalk barrier, which is too similar to the very common \u201cwindow\nwith railing\u201d element, which takes over the cluster. Right: with the algorithm from section 3, the window gets\ndown-weighted and the algorithm can learn the sidewalk barrier.\nNote that (5) is equivalent to (4) for some appropriate rescaling of  and . It can be easily shown\nthat multiplying  by a constant factor does not change the relative location of local maxima, as long\nas we divide  by that same factor. Such a re-scaling will in fact result in re-scaling w by the same\nvalue, so we can choose a  and  which makes the norm of w equal to 1. 1\nAfter this rewriting, we are left with an objective that looks curiously like a margin-based method.\nIndeed, the negative set is treated very much like the negative set in an SVM (we penalize the linear\nsum of the margin violations), which follows [23]. However, unlike [23], which makes the ad-hoc\nchoice of 5 positive examples, our algorithm allows each cluster to select the optimal number of\npositives based on the decision boundary. This is somewhat reminiscent of unsupervised margin-\nbased clustering [29, 16].\nMean-shift prescribes that we initialize the procedure outlined above at every datapoint.\nIn our\nsetting, however, this is not practical, so we instead use a randomly-sampled subset. We run this\nas an online algorithm by breaking the dataset into chunks and then mining, one chunk at a time,\nfor patches where w>x  b > \u270f for some small \u270f, akin to \u201chard mining\u201d for SVMs. We perform\ngradient ascent after each mining phase. An example result for this algorithm is shown in in Figure 2,\nand we include further results below. Gradient ascent on our objective is surprisingly ef\ufb01cient, as\ndescribed in Appendix A.\n\n3 Better Adaptive Bandwidth via Inter-Element Communication\nImplicit in our formulation thus far is the idea that we do not want a single mode, but instead many\ndistinct modes which each corresponds to a different element. In theory, mode-seeking will \ufb01nd\nevery mode that is supported by the data.\nIn practice, clusters often drift from weak modes to\nstronger modes, as demonstrated in Figure 2 (middle). One way to deal with this is to assign smaller\nbandwidths to patches in dense regions of the space [4], e.g., the window railing on row 1 of Figure 2\n(middle) would hopefully have a smaller bandwidth and hence not match to the sidewalk barrier.\nHowever, estimating a bandwidth for every datapoint in our setting is not practical, so we seek an\napproach which only requires one pass through the data. Since patches in regions of the feature space\nwith high density ratio will be members of many clusters, we want a mechanism that will reduce\ntheir bandwidth. To accomplish this, we extend the standard local (per-element) optimization of\nmean-shift into a joint optimization among the m different element clusters. Speci\ufb01cally, we control\nhow a single patch can contribute to multiple clusters by introducing a sharing weight \u21b5i,j for each\npatch i that is contained in a cluster j, akin to soft-assignment in EM GMM \ufb01tting. Returning to our\nfomulation, we maximize (again with respect to the w\u2019s and b\u2019s):\n\nnposXi=1\n\nmXj=1\n\nnnegXi=1\n\nmax(w>j xi  bj, 0) = \n\n(6)\n\n\u21b5i,j max(w>j x+\n\ni  bj, 0)  \n\nkwjk2 s.t. 8j\n\nmXj=1\n\nWhere each \u21b5i,j is chosen such that any patch which is a member of multiple clusters gets a\nlower weight.\n(6) also has a natural interpretation in terms of maximizing the \u201crepresentative-\nness\u201d of the set of clusters: clusters are rewarded for representing patches that are not repre-\nsented by other clusters. But how can we set the \u21b5\u2019s? One way is to set \u21b5i,j = max(w>j x+\ni \ni  bk, 0), and alternate between setting the \u21b5\u2019s and optimizing the w\u2019s and\n1Admittedly this means that the norm of w has an indirect effect on the underlying bandwidth: speci\ufb01cally\nif the norm of w is increased, it has a similar effect as a proportional derease in  in (4). However, since w\nis roughly proportional to the density of the positive data, the bandwidth is only reduced when the density of\npositive data is high.\n\nbj, 0)/Pm\n\nk=1 max(w>k x+\n\n4\n\n\f25 Elements\n\ny\nt\ni\nr\nu\nP\n\n1\n0.98\n0.96\n0.94\n0.92\n0.9\n0.88\n0.86\n0.84\n0.82\n0.8\n0\n\n1\n0.98\n0.96\n0.94\n0.92\n0.9\n0.88\n0.86\n0.84\n0.82\n0.8\n0\n\ny\nt\ni\nr\nu\nP\n\n0.5\n\n)\nt\n\nt\n\n \n\ne\ns\na\na\nD\ne\nv\ni\nt\ni\ns\no\nP\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n(\n \n\ne\ng\na\nr\ne\nv\no\nC\n\n200 Elements\n\nPurity of 75%\n\nThis work\nThis work, no inter-element\nSVM Retrained 5x (Doersch et al. 2012)\nLDA Retrained 5x\nLDA Retrained\nExemplar LDA (Hariharan et al. 2012)\n\n \n\n10\n9\n8\n7\n6\n5\n4\n\n3\n2\n1\n \n0.4\n250\nCoverage (Fraction of Positive Dataset)\nNumber of Elements\n\n0.6\n400\n\n350\n\n300\n\n0.2\n\n0.8\n450\n\n500\n\n0.1\n0.4\nCoverage (Fraction of Positive Dataset)\n\n0.2\n\n0.3\n\nFigure 3: Purity-coverage graph for our algorithm and baselines. In each plot, purity measures the accuracy\nof the element detectors, whereas coverage captures how often they \ufb01re. Curves are computed over the top 25\n(left) and 200 (right) elements. Higher is better.\nb\u2019s at each iteration. Intuitively, this algorithm would be much like EM, alternating between softly\nassigning cluster memberships for each datapoint and then optimizing each cluster. However, this\ngoes against our mean-shift intuition: if two patches are really instances of the same element, then\nclusters initialized from those two points should converge to the same mode and not \u201ccompete\u201d with\none another. So, our heuristic is to \ufb01rst cluster the elements. Let Cj be the assigned cluster for the\nj\u2019th element. Then we set\n\n\u21b5i,j =\n\nmax(w>j x+\n\ni  bj, 0) +Pm\n\nmax(w>j x+\n\ni  bj, 0)\n\nk=1 I(Ck 6= Cj) max(w>k x+\n\ni  bk, 0)\n\n(7)\n\nIn this way, any \u201ccompetition\u201d from elements that are too similar to each other is ignored. To obtain\nthe clusters, we perform agglomerative (UPGMA) clustering on the set of element clusters, using\nthe negative of the number of overlapping cluster members as a \u201cdistance\u201d metric.\nIn practice, however, it is extremely rare that the exact same patch is a member of two different clus-\nters; instead, clusters will have member patches that merely overlap with each other. Our heuristic\ndeal with this is to compute a quantity \u21b50i,j,p which is analogous to the \u21b5i,j de\ufb01ned above, but is\nde\ufb01ned for every pixel p. Then we compute \u21b5i,j for a given patch by averaging \u21b50i,j,p over all pixels\nin the patch. Speci\ufb01cally, we compute \u21b5i,j for patch i as the mean over all pixels p in that patch of\nthe following quantity:\n\nmax(w>j x+\n\n\u21b50i,j,p =\n\nmax(w>j x+\n\ni  bj, 0) +Px2Ov(p)Pm\n\ni  bj, 0)\nk=1 I(Ck 6= Cj) max(w>k x+\nWhere Ov(p) denotes the set of features for positive patches that contain the pixel p.\nIt is admittedly dif\ufb01cult to analyze how well these heuristics approximate the adaptive bandwidth\napproach of [4], and even there the setting of the bandwidth for each datapoint has heuristic aspects.\nHowever, empirically our approach leads to improvements in performance as discussed below, and\nsuggests a potential area for future work.\n\ni  bk, 0)\n\n(8)\n\n4 Evaluation via Purity-Coverage Plot\nOur aim is to discover visual elements that are maximally representative and discriminative. To\nmeasure this, we de\ufb01ne two quantities for a set of visual elements: coverage (which captures rep-\nresentativeness) and purity (which captures discriminativeness). Given a held-out test set, visual\nelements will generate a set of patch detections. We de\ufb01ne the coverage of this set of patches to be\nthe fraction of the pixels from the positive images claimed by at least one patch. We de\ufb01ne the purity\nof a set as the percentage of the patches that share the same label. For an individual visual element,\nof course, there is an inherent trade-off between purity and coverage: if we lower the detection\nthreshold, we cover more pixels but also increase the likelihood of making mistakes. Hence, we can\nconstruct a purity-coverage curve for a set of elements, analogous to a precision-recall curve. We\ncould perform this analysis on any dataset containing positive and negative images, but [5] presents\na dataset which is particularly suitable. The goal is to mine visual elements which de\ufb01ne the look\nand feel of a geographical locale, with a training set of 2,000 Paris Street View images and 8,000\n\n5\n\n\f)\nt\n\n \n\nt\n\ne\ns\na\na\nD\ne\nv\ni\nt\ni\ns\no\nP\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n(\n \ne\ng\na\nr\ne\nv\no\nC\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0\n\n100\n\nPurity of 100%\n\n)\nt\n\nt\n\n \n\ne\ns\na\na\nD\ne\nv\ni\nt\ni\ns\no\nP\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n(\n \ne\ng\na\nr\ne\nv\no\nC\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n200\n\n300\n\nNumber of Elements\n\n400\n\n500\n\n0\n \n0\n\n100\n\nPurity of 90%\n\n \n\nThis work\nThis work, no inter-element\nSVM Retrained 5x (Doersch et al. 2012)\nLDA Retrained 5x\nLDA Retrained\nExemplar LDA (Hariharan et al. 2012)\n\n200\n\n300\n\nNumber of Elements\n\n400\n\n500\n\nFigure 4: Coverage versus the number of elements used in the representation. On the left we keep only the\ndetections with a score higher than the score of the detector\u2019s \ufb01rst error (i.e. purity 1). On the right, we lower\nthe detection threshold until the elements are 90% pure. Note: this is the same purity and coverage measure for\nthe same elements as Figure 3, just plotted differently.\n\nnon-Paris images, as well as 2,999 of both classes for testing. Purity-coverage curves for this dataset\nare shown in Figure 3.\nTo plot the curve for a given value of purity p, we rank all patches by w>xb independently for every\nelement, and select, for a given element, all patches up until the last point where the element has the\ndesired purity. We then compute the coverage as the union of patches selected for every element.\nBecause we are taking a union of patches, adding more elements can only increase coverage, but in\npractice we prefer concise representations, both for interpretability and for computational reasons.\nHence, to compare two element discovery methods, we must select exactly the same number of\nelements for both of them. Different works have proposed different heuristics for selecting elements,\nwhich would make the resulting curves incomparable. Hence, we select elements in the same way\nfor all algorithms, which approximates an \u201cideal\u201d selection for our measure. Speci\ufb01cally, we \ufb01rst\n\ufb01x a level of purity (95%) and greedily select elements to maximize coverage (on the testing data)\nfor that level of purity. Hence, this ranking serves as an oracle to choose the \u201cbest\u201d set of elements\nfor covering the dataset at that level of purity. While this ranking has a bias toward large elements\n(which inherently cover more pixels per detection), we believe that it provides a valuable comparison\nbetween algorithms. Our purity-coverage curves are shown in Figure 3, for the 25 and 200 top\nelements, respectively. We can also slice the same data differently, \ufb01xing a level of purity for all\nelements and varying the number of elements, as shown in Figure 4.\nBaselines: We included \ufb01ve baselines of increasing complexity. Our goal is not only to analyze our\nown algorithm; we want to show the importance of the various components of previous algorithms\nas well. We initially train 20, 000 visual elements for all the baselines, and select the top elements\nusing the method above. The simplest baseline is \u201cExemplar LDA,\u201d proposed by [10]. Each cluster\nis represented by a hyperplane which maximally separates a single seed patch from the negative\ndataset learned via LDA, i.e. the negative distribution is approximated using a single multivariate\nGaussian. To show the effects of re-clustering, \u201cLDA Retrained\u201d takes the top 5 positive-set patches\nretrieved in Exemplar LDA (including the initial patch itself), and repeats LDA, separating those 5\nfrom the negative Gaussian. This is much like the well-established method of \u201cquery expansion\u201d for\nretrieval, and is similar to [12] (although they use multiple iterations of query expansion). Finally,\n\u201cLDA Retrained 5 times\u201d begins with elements initialized via the LDA retraining method, and re-\ntrains the LDA classi\ufb01er, each time throwing out the previous top 5 used to train the previous LDA,\nand selecting a new top 5 from held-out data. This is much like the iterative SVM training of [5],\nexcept that it uses LDA instead of an SVM. Finally, we include the algorithm of [5], which is a\nweakly supervised version of [23], except that knn is being used for initialization instead of kmeans.\nThe iterations of retraining clearly improve performance, and it seems that replacing LDA with an\nSVM also gives improvement, especially for dif\ufb01cult elements.\nImplementation details: We use the same patch descriptors described in [5] and whiten them fol-\nlowing [10]. We mine elements using the online version of our algorithm, with a chunk size of 1000\n(200 Paris, 800 non-Paris per batch). We set \u21e4  = t/500 where t is the iteration number, such that\nthe bandwidth increases proportional to the number of samples. We train the elements for about 200\n\n6\n\n\fFigure 5: For each correctly classi\ufb01ed image (left), we show four elements (center) and heatmap of\nthe locations (right) that contributed most to the classi\ufb01cation.\n\nTable 1: Results on MIT 67 scenes\n\nROI + Gist [19]\nMM-scene [30]\nDPM [17]\nCENTRIST [28]\nObject Bank [14]\nRBoW [18]\n\n26.05 D-Patches [23]\n28.00\nLPR [22]\n30.40 BoP [12]\n36.90 miSVM [15]\n37.60 D-Patches (full) [23]\n37.93 MMDL [27]\n\nIFV [12]\n\n38.10 D-Parts [26]\n44.84\n46.10 BoP+IFV [12]\n46.40 Ours (no inter-element, \u00a72)\n49.40 Ours (\u00a73)\n50.15 Ours+IFV\n\n51.40\n60.77\n63.10\n63.36\n64.03\n66.87\n\ngradient steps after each chunk of mining. To compute \u21b5i,j for patch i and detector j, we actually use\nscale-space voxels rather than pixels, since a large detection can completely cover a small detection\nbut not vice versa. Hence, the set of scale-space voxels covered is a 3D box, the width of the bound-\ning box by its height (both discretized by a factor of 8 for ef\ufb01ciency) by 5, covering exactly one\n\n\u201coctave\u201d of scale space (i.e. log2(pwidth \u21e4 height) \u21e4 5 through log2(pwidth \u21e4 height) \u21e4 5 + 4).\nFor experiments without inter-element communication, we simply set \u21b5i,j to .1. Finally, to reduce\nthe impact of highly redundant textures, we divide \u21b5i,j divided by the total number of detections for\nelement j in the image containing i. Source code will be available online.\n\n5 Scene Classi\ufb01cation\nFinally, we evaluate whether our visual element representation is useful for scene classi\ufb01cation. We\nuse the MIT Scene-67 dataset [19], where machine performance remains substantially below human\n\n7\n\n\fFigure 6: Each of these images was misclassi\ufb01ed by the algorithm, and the heatmaps explain why.\nFor instance, it may not be obvious why a corridor would be classi\ufb01ed as a staircase, but we can see\n(top right) that the algorithm has identi\ufb01ed the railings as a key staircase element, and has found no\nother staircase elements the image.\nperformance. For indoor scenes, objects within the scene are often more useful features than global\nscene statistics [12]: for instance, shoe shops are similar to other stores in global layout, but they\nmostly contain shoes.\nImplementation details: We used the original Indoor-67 train/test splits (80 training and 20 testing\nimages per class). We learned 1600 elements per class, for a total of 107, 200 elements, following\nthe procedure described above. We include right-left \ufb02ipped images as extra positives. 5 batches\nwere suf\ufb01cient, as this dataset is smaller. We also used smaller descriptors: 6-by-6 HOG cells,\ncorresponding to 64-by-64 patches and 1188-dimensional descriptors. We again select elements\nby \ufb01xing purity and greedily selecting elements to maximize coverage, as above. However, rather\nthan de\ufb01ning coverage as the number of pixels (which is biased toward larger elements), we simply\ncount the detections, penalizing for overlap: we penalize each individual detection by a factor of\n1/(1 + noverlap), where noverlap is the number of detections from previously selected detectors\nthat a given detection overlaps with. We select 200 top elements per class. To construct our \ufb01nal\nfeature vector, we use a 2-level (1x1 and 2x2) spatial pyramid and take the max score per detector\nper region, thresholded at .5 (since below this value we do not expect the detection scores to be\nmeaningful) resulting in a 67,000-dimensional vector. We average the feature vector for the right\nand left \ufb02ips of the image, and classify using 67 one-vs-all linear SVM\u2019s. Note that this differs from\n[23], which selects only the elements for a given class in each class-speci\ufb01c SVM.\nFigure 5 shows a few qualitative results of our algorithm. Quantitative results and comparisons\nare shown in Table 1. We signi\ufb01cantly outperform other methods based on discriminative patches,\nsuggesting that our training method is useful. We even outperform the Improved Fisher Vector\nof [12], as well as IFV combined with discriminative patches (IFV+BoP). Finally, although the\noptimally-performing representation is dense (about 58% of features are nonzero), it can be made\nmuch sparser without sacri\ufb01cing much performance. For instance, if we trivially zero-out low-\nvalued features until fewer than 6% are nonzero, we still achieve 60.45% accuracy.\n6 Conclusion\nWe developed an extension of the classic mean-shift algorithm to density ratio estimation, showing\nthat the resulting algorithm could be used for element discovery, and demonstrating state-of-the-art\nresults for scene classi\ufb01cation. However, there is still much room for improvement in weakly-\nsupervised element discovery algorithms. For instance, our algorithm is limited to binary labels, but\nimage labels may be continuous (e.g., GPS coordinates or dates). Also, our elements are detected\nbased only on individual patches, but images often contain global structures beyond patches.\nAcknowledgements: We thank Abhinav Shrivastava, Yong Jae Lee, Supreeth Achar, and Geoff Gordon for helpful insights\nand discussions. This work was partially supported by NDSEG fellowship to CD, An Amazon Web Services grant, a Google\nResearch grant, ONR MURI N000141010934, and IARPA via Air Force Research Laboratory. The U.S. Government is\nauthorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon.\nDisclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of IARPA, AFRL or the U.S. Government.\nReferences\n[1] H. E. Cetingul and R. Vidal. Intrinsic mean shift for clustering on Stiefel and Grassmann manifolds. In\n\nCVPR, 2009.\n\n8\n\nGround Truth (GT): deli GT: laundromat GT: corridor Guess: grocery store Guess: closet Guess: staircase GT: museum GT: office GT: bakery Guess: garage Guess: classroom Guess: buffet \f[2] Y. Cheng. Mean shift, mode seeking, and clustering. PAMI, 17(8):790\u2013799, 1995.\n[3] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In\n\nCVPR, 2000.\n\n[4] D. Comaniciu, V. Ramesh, and P. Meer. The variable bandwidth mean shift and data-driven scale selec-\n\ntion. In ICCV, 2001.\n\n[5] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What makes Paris look like Paris? SIGGRAPH,\n\n2012.\n\n[6] I. Endres, K. Shih, J. Jiaa, and D. Hoiem. Learning collections of part models for object recognition. In\n\nCVPR, 2013.\n\n[7] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives for single image understanding. In\n\nICCV, 2013.\n\n[8] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in\n\npattern recognition. Information Theory, 1975.\n\n[9] B. Georgescu, I. Shimshoni, and P. Meer. Mean shift based clustering in high dimensions: A texture\n\nclassi\ufb01cation example. In CVPR, 2003.\n\n[10] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classi\ufb01cation.\n\nIn ECCV, 2012.\n\n[11] A. Jain, A. Gupta, M. Rodriguez, and L. Davis. Representing videos using mid-level discriminative\n\npatches. In CVPR, 2013.\n\n[12] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that shout: Distinctive parts for scene\n\nclassi\ufb01cation. In CVPR, 2013.\n\n[13] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[14] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene\n\nclassi\ufb01cation and semantic feature sparsi\ufb01cation. NIPS, 2010.\n\n[15] Q. Li, J. Wu, and Z. Tu. Harvesting mid-level visual concepts from large-scale internet images. In CVPR,\n\n2013.\n\n[16] T. Malisiewicz and A. A. Efros. Recognition by association via learning per-exemplar distances.\n\nCVPR, 2008.\n\nIn\n\n[17] M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable\n\npart-based models. In ICCV, 2011.\n\n[18] S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb. Recon\ufb01gurable models for scene recognition.\n\nCVPR, 2012.\n\nIn\n\n[19] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.\n[20] M. Radovanovi\u00b4c, A. Nanopoulos, and M. Ivanovi\u00b4c. Nearest neighbors in high-dimensional data: The\n\nemergence and in\ufb02uence of hubs. In ICML, 2009.\n\n[21] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman. Using multiple segmentations to\n\ndiscover objects and their extent in image collections. In CVPR, 2006.\n\n[22] F. Sadeghi and M. F. Tappen. Latent pyramidal regions for recognizing scenes. In ECCV. 2012.\n[23] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-level discriminative patches. In\n\nECCV, 2012.\n\n[24] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV,\n\n2003.\n\n[25] M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio estimation: A comprehensive review. RIMS\n\nKokyuroku, 2010.\n\n[26] J. Sun and J. Ponce. Learning discriminative part detectors for image classi\ufb01cation and cosegmentation.\n\nIn ICCV, 2013.\n\n[27] X. Wang, B. Wang, X. Bai, W. Liu, and Z. Tu. Max-margin multiple-instance dictionary learning. In\n\nICML, 2013.\n\n[28] J. Wu and J. M. Rehg. Centrist: A visual descriptor for scene categorization. PAMI, 2011.\n[29] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004.\n[30] J. Zhu, L.-J. Li, L. Fei-Fei, and E. P. Xing. Large margin learning of upstream scene understanding\n\nmodels. NIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 335, "authors": [{"given_name": "Carl", "family_name": "Doersch", "institution": "CMU"}, {"given_name": "Abhinav", "family_name": "Gupta", "institution": "CMU"}, {"given_name": "Alexei", "family_name": "Efros", "institution": "UC Berkeley"}]}