{"title": "Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships", "book": "Advances in Neural Information Processing Systems", "page_first": 1222, "page_last": 1230, "abstract": "The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the objects relationship to other elements of the scene (context).  Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context.  We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralbas proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.", "full_text": "Beyond Categories: The Visual Memex Model for\n\nReasoning About Object Relationships\n\nTomasz Malisiewicz, Alexei A. Efros\n\nRobotics Institute\n\nCarnegie Mellon University\n\n{tmalisie,efros}@cs.cmu.edu\n\nAbstract\n\nThe use of context is critical for scene understanding in computer vision, where\nthe recognition of an object is driven by both local appearance and the object\u2019s re-\nlationship to other elements of the scene (context). Most current approaches rely\non modeling the relationships between object categories as a source of context.\nIn this paper we seek to move beyond categories to provide a richer appearance-\nbased model of context. We present an exemplar-based model of objects and their\nrelationships, the Visual Memex, that encodes both local appearance and 2D spatial\ncontext between object instances. We evaluate our model on Torralba\u2019s proposed\nContext Challenge against a baseline category-based system. Our experiments\nsuggest that moving beyond categories for context modeling appears to be quite\nbene\ufb01cial, and may be the critical missing ingredient in scene understanding sys-\ntems.\n\n1 Introduction\n\nImage understanding is one of the Holy Grail problems in computer vision. Understanding a scene\narguably requires parsing the image into its constituent objects. In real scenes composed of many\ndifferent objects, the spatial con\ufb01guration of one object can facilitate recognition of related ob-\njects [1], and quite often ambiguities in recognition cannot be resolved without looking beyond the\nspatial extent of the object in question. Thus, algorithms which jointly recognize many objects at\nonce by taking account of contextual relationships have been quite popular. While early systems\nrelied on hand-coded rules for inter-object context (e.g. [2, 3]), more modern approaches typically\nperform inference in a probabilistic graphical model with respect to categories where object interac-\ntions are modeled as higher order potentials [4, 5, 6, 7, 8, 9, 10]. One important implicit assumption\nmade by all such models is that interactions between object instances can be adequately modeled as\nrelationships between human-de\ufb01ned object categories.\nIn this paper we challenge this \u201ccategory assumption\u201d for object-object interactions and propose\na novel category-free approach for modeling object relationships. We propose a new framework,\nthe Visual Memex Model, for representing and reasoning about object identities and their contex-\ntual relationships in an exemplar-based, non-parametric way. We evaluate our model on Antonio\nTorralba\u2019s proposed Context Challenge [11] against a baseline category-based system.\n\n2 Motivation\n\nThe use of categories (classes) to represent concepts (e.g. visual objects) is so prevalent in computer\nvision and machine learning that most researchers don\u2019t give it a second thought. Faced with a new\ntask, one simply carves up the solution space into classes (e.g. cars, people, buildings), assigns class\nlabels to training examples and applies one of the many popular classi\ufb01ers to arrive at a solution.\n\n1\n\n\fHowever, we believe that it is worthwhile to re-examine the basic assumption behind categorization,\nand especially its role in modeling relationships between objects.\nTheories of categorization date back to the ancient Greeks. Aristotle de\ufb01ned categories as discrete\nentities characterized by a set of properties shared by all their members [12]. His categories are\nmutually exclusive, and every member of a category is equal. This classical view is still the most\nwidely accepted way of reasoning about categories and taxonomies in hard sciences. However, as\npointed out by Wittgenstein, this is almost certainly not the way most of our everyday concepts work\n(e.g. what is the set of properties that de\ufb01ne the concept \u201cgame\u201d and nothing else? [13]). Empirical\nevidence for typicality (e.g. a robin is a more commonly cited example of \u201cbird\u201d than a chicken)\nand multiple category memberships (e.g. chicken is both \u201cbird\u201d and \u201cfood\u201d) further complicate the\nAristotelian view.\nThe ground-breaking work of cognitive psychologist Eleanor Rosch [14] demonstrated that humans\ndo not cut up the world into neat categories de\ufb01ned by shared properties, but instead use similarity\nas the basis of categorization. Her Prototype Theory postulates that an object\u2019s class is determined\nby its similarity to (a set of) prototypes which de\ufb01ne each category, allowing for varying degree of\nmembership. Such Prototype models have been successfully used for object recognition [15, 16].\nGoing even further, Exemplar Theory [17, 18] rejects the need for explicit category representation,\narguing instead that a concept can be implicitly formed via all its observed instances. This allows\nfor a dynamic de\ufb01nition of categories based on data availability and task (e.g. an object can be a\nvehicle, a car, a Volvo, or Bob\u2019s Volvo). A recent operationalization of the exemplar model in the\nvisual domain can be found in [19].\nBut it might not be too productive to concentrate on the various categorization theories without con-\nsidering the \ufb01nal aim \u2013 what do we need categories for? One argument is that categorization is a tool\nto facilitate knowledge transfer. E.g. having been attacked once by a tiger, it\u2019s critically important\nto determine if a newly observed object belongs to the tiger category so as to utilize the information\nfrom the previous encounter. Note that here recognizing the explicit category is unimportant, as\nlong as the two tigers could be associated with each other. Guided by this intuition and evidence\nfrom cognitive neuroscience, Bar [20] outlined the importance of analogies, associations, and pre-\ndiction in the human brain. He argues that the goal of visual perception is not to recognize an object\nin the traditional sense of categorizing it (i.e. asking \u2019what is this?\u2019), but instead linking the input\nwith an analogous representation in memory (i.e. asking \u2019what is this like?\u2019). Once a novel input is\nlinked with analogous representations, associated representations are activated rapidly and predict\nthe representations of what is most likely to occur next.\nThese ideas regarding analogies, associations, and prediction are surprisingly similar to Vannevar\nBush\u2019s 1945 concept of the Memex [21] \u2013 which was seen decades later as pioneering hypertext and\nthe World Wide Web. Concerned with the transmission and accessibility of scienti\ufb01c ideas, Bush\nfaulted the \u201carti\ufb01ciality of systems of indexing\u201d and proposed the Memory Extender (Memex), a\nphysical device which would help \ufb01nd information based on association instead of strict categorical\nindexing. The associative links were to be entered manually by the user and could be of several\ndifferent types. Chains of links would form into longer \u201cassociative trails\u201d creating new narratives\nin the concept space. For Bush \u201cthe process of tying two items together is the important thing.\u201d\nInspired by these diverse ideas that are, nonetheless, all pointing in the same general direction, we\nhave been motivated to try to evaluate them on a concrete problem, to see if they can offer bene\ufb01ts\nover the more traditional classi\ufb01cation framework. One particular area where we feel these ideas\nmight prove very useful is in modeling relationships between objects within an image. Therefore,\nin this paper we propose, in an homage to Bush, the Visual Memex Model, as a \ufb01rst step towards\noperationalizing the direct modeling of associations between visual objects, and compare it with\nmore standard tools for the same task.\n\n3 The Visual Memex Model\nOur starting point is Vannevar Bush\u2019s observation that strict categorical indexing of concepts has\nsevere limitations. Abandoning rigid object categories, we embrace Bush\u2019s and Bar\u2019s belief in the\nprimary role of associations, but unlike Bush, we aim to discover these associations automatically\nfrom the data. At the core of our model is an exemplar-based representation of objects [18, 19]. The\nVisual Memex can then be thought of as a vast graph, with nodes representing all the object instances\n\n2\n\n\fFigure 1: The Visual Memex graph encodes object similarity (solid black edge) and spatial context\n(dotted red edge) between pairs of object exemplars. A spatial context feature is stored for each\ncontext edge. The Memex graph can be used to interpret a new image (left) by associating image\nsegments with exemplars in the graph (orange edges) and propagating the information. Figure best\nviewed in color.\n\nin the dataset, and arcs representing the different types of associations between them (Figure 1).\nThere are two types of arcs in our model, encoding two different relationships between objects: 1)\nvisual similarity (e.g. this car looks like that car), and 2) contextual associations (e.g. this car is next\nto this building).\nOnce the graph is built, it can be used to interpret a novel image (Figure 1, left) by \ufb01rst connecting\nsegments within the image with similar stored exemplars, and then propagating contextual informa-\ntion between these exemplars through the graph. When an exemplar gets activated, visually similar\nexemplars as well as other contextually relevant objects get activated as well. This way, exemplar-\nto-exemplar similarity in the Memex graph can serve as Bush\u2019s \u201ctrails\u201d to link concepts together\nin a non-parametric, query-dependent way, without the use of prede\ufb01ned categories. For exam-\nple, in Figure 1, we should be able to infer that a car seen from the rear often co-occurs with an\noblique building wall (but not a frontal wall) \u2013 something which category-based models would be\nhard-pressed to achieve.\nFormally, we de\ufb01ne the Visual Memex Model as a graph G = (V, ES, EC,{D},{f}) consisting\nof N object exemplar nodes V , similarity edges ES, context edges EC, N per-exemplar similarity\nfunctions {D}, and the spatial features {f} associated with each context edge. We now describe\nhow to learn the similarity functions {D} from data to create the structure of the Visual Memex.\n\n3.1 Similarity Edges\n\nWe use the per-exemplar distance-function learning algorithm of Malisiewicz et al [19] to learn the\nobject similarity edges. For each exemplar, the algorithm learns which other exemplars it is similar\nto as well as a distance function. A distance function is a linear combination of elementary distances\nused to measure similarity to the exemplar. We use the same 14 color, shape, texture, and location\nfeatures as used in [19]. For the j-th exemplar, wj is the vector of 14 weights, bj is a scalar bias,\nand \u03b1j \u2208 {0, 1}|C| is a binary indicator vector which encodes which other exemplars the current\nexemplar is similar to. We solve [w\u2217\nj ] = arg minw,b,\u03b1fj(w, b, \u03b1), but since the exemplars\u2019\noptimization problems are independent we drop the j suf\ufb01x for clarity. Let di be the vector of 14\nEuclidean distances between the exemplar whose similarity we are learning (the focal exemplar)\nand the i-th exemplar. C is the set of exemplars that have the same label as the focal exemplar. Let\nL(x) = max(1 \u2212 x, 0)2 be the hinge-squared loss function. A different w, b, and \u03b1 are learned\nper-exemplar by optimizing the following functional:\n\nj , \u03b1\u2217\n\nj , b\u2217\n\nL(wT di + b) \u2212 \u03c3|| \u03b1||2\n\n(1)\n\n||w||2 +X\n\n\u03b1i L(\u2212(wT di + b)) +X\n\nf(w, b, \u03b1) = \u03bb\n2\n\ni\u2208C\n\ni /\u2208C\n\nWe minimize the above SVM-like objective function via an alternating optimization strategy as in\n[19]. The algorithm uses labels (see Section 3.3) during learning where the regularization term favors\nconnecting to many similarly-labeled exemplars and the loss term favors separability in distance\n\n3\n\n\fFigure 2: Torralba\u2019s Context Challenge: \u201cHow far can you go without running a local object de-\ntector?\u201d The task is to reason about the identity of the hidden object (denoted by a \u201c?\u201d) without\nlocal information. In our category-free Visual Memex model, object predictions are generated in the\nform of exemplar associations for the hidden object. In a category-based model, the category of the\nhidden object is directly estimated.\n\nspace. We create a similarity edge between two exemplars if they are deemed similar by each\nothers\u2019 distance functions. We use a \ufb01xed \u03bb = .00001 and \u03c3 = 100 for all exemplars.\n\n3.2 Context Edges\n\nWhen two objects occur inside a single image, we encode their 2-D spatial relationship into a con-\ntext feature vector f \u2208 <10 (visualized as red dotted edges in Figure 1). The context feature vector\nencodes relative overlap, relative displacement, relative scale, and relative height of the bottom-most\npixel between two exemplar regions in a single image. This feature captures the spatial relationship\nbetween two regions and does not take into account any appearance information \u2013 it is a general-\nization of the spatial features used in [8]. We measure the similarity between two context features\nusing a Gaussian kernel: K(f , f0) = e\u2212\u03b11|| f \u2212 f0 ||2 with \u03b11 = 1.0.\n\n3.3 Building the Visual Memex\n\nWe extract a large database of exemplar objects and their ground-truth segmentation masks from\nthe LabelMe [22] dataset and learn the structure of the Visual Memex in an of\ufb02ine setting. We use\nobjects from the 30 most frequently occurring categories in LabelMe. Similarity edges are created\nusing the per-exemplar distance function learning framework of [19], and context edges are created\neach time two exemplars are observed in the same image. We have a total of N = 87, 802 exemplars\nin the Visual Memex, |ES| = 276, 782 similarity edges, and |EC| = 989, 106 context edges.\n\n4 Evaluating on the Context Challenge\nThe intuition that we would like to evaluate is that many useful regularities of the visual world are\nlost when dealing solely with categories (e.g. the side view of a building should associate more with\na side view of a car than a frontal view of a car). The key motivation behind the Visual Memex is\nthat context should depend on the appearance of an object and not just the category it belongs to. In\norder to test this hypothesis against the commonly held practice of abstracting away appearance into\ncategories, we need a rich evaluation dataset as well as a meaningful evaluation task.\nWe found that the Context Challenge [11] recently proposed by Antonio Torralba \ufb01ts our needs\nperfectly. The evaluation task is inspired by the question: \u201cHow far can you go without running an\nobject detector?\u201d The goal is to recognize a single object in the image without peeking at pixels\nbelonging to that object. Torralba presented an algorithm for predicting the category and scale of\nan object using only contextual information [23], but his notion of context is scene-centered (where\nthe appearance of the entire image is used for prediction). Since the context we wish study in this\npaper is object-centered, we use an object-centered formulation of the Context Challenge. While\nit is not clear if the absolute performance numbers on the Context Challenge are very meaningful\nin themselves, we feel that it is an ideal task for studying object-centered context and the role of\ncategorization assumptions in such models.\n\n4\n\nassociations?cardoorroadwindowwindowtreecartreewindowwheelwheel?CategoryEstimationcarpersonwheelroadwindowfencebuildingsidewalkdoortree\fIn our variant of the Context Challenge, the goal is to predict the category of a hidden object yi solely\nbased on its spatial relationships to some provided objects \u2013 without using the pixels belonging to\nthe hidden object at all. For our study, we use manually provided regions and category labels of K\nsupporting objects inside a single image. We refer to the identities of the K supporting objects in the\nimage as {y1, . . . , yK} (where y \u2208 {1, . . . ,|C|}) and the set of K 2D spatial relationship features\nbetween each supporting object and the hidden object as {f i1, . . . , f iK}.\n\n4.1\n\nInference in the Visual Memex Model\n\nIn this section, we explain how to use the Visual Memex graph (automatically constructed from\ndata) to perform inference for the Context Challenge hidden-object prediction task. Not making the\n\u201ccategory assumption,\u201d the model is de\ufb01ned with respect to exemplar associations for the hidden\nobject. Inference in the model returns a compatibility score between every exemplar and the hidden\nobject, and can be though of as returning an ordered list of exemplar associations. Due to the nature\nof exemplar associations as opposed to category assignments, a supporting object can be associated\nwith multiple exemplars as opposed to a single category. We create soft exemplar associations\nbetween each of the supporting objects and the exemplars in the Visual Memex using the similarity\nfunctions {D} (see Section 3.1).\n{S1, . . . , SK} are the appearance features for the K supporting objects. Aa\nj is the af\ufb01nity between\nexemplar a in the Visual Memex and the j-th supporting object and is created by evaluating Sj\nj = e\u2212Da(Sj ). \u03a8(ei, ej, f ij) is the pairwise compatibility between\nunder a\u2019s distance function Aa\nexemplar ei and ej under the spatial feature f ij. Let Wab be the adjacency matrix representation\nof the similarity edges (Wuv = [(u, v) \u2208 ES]). Inference in the Visual Memex Model is done\nby optimizing the following conditional distribution which scores the assignment of an arbitrary\nexemplar ei to the hidden object based on contextual relations:\n\np(ei|A1, . . . , AK, f i1, . . . , f iK) \u221d KY\nP\n\nj=1\n\nlog \u03a8(ei, ej, f ij) =\n\nNX\nP\n\na=1\n(u,v)\u2208EC\n\nAa\n\nj \u03a8(ei, ea, f ij)\n\nWiuWjvK(f ij, f uv)\n\n(u,v)\u2208EC\n\nWiuWjv\n\n(2)\n\n(3)\n\nThe reason for the summation inside Equation 3 is that it aggregates contextual interactions from\nsimilar exemplars. By doing this, we effectively \u201cdensify\u201d the contextual interactions in the Visual\nMemex. An interpretation of this densi\ufb01cation procedure is that we are creating a kernel density\nestimator for an arbitrary pair of exemplars (ei, ej) via a weighted sum of kernels placed at context\nfeatures in the data set {f uv} : (u, v) \u2208 EC where the weights WiuWjv measure visual similarity\nbetween pairs (ei, ej) and (eu, ev) .\nWe experimented with using a single kernel, \u03a8(ei, ej| f ij) = K(f ij, f ei,ej ), and found that the in-\ntegration of multiple features via the densi\ufb01cation described above is a key ingredient for successful\nVisual Memex inference.\nFinally, after performing inference in the Visual Memex Model, we are left with a score for each ex-\nemplar. At this stage, as far as our model is concerned, the recognition has already been performed.\nHowever, since the task we are evaluated on is category-based, we combine the returned exemplars\ninto a vote for categories using Luce\u2019s Axiom of Choice [17] which averages the exemplar responses\nper-category.\n\n4.2 CoLA-based Parametric Model\n\nWe would like to evaluate the Visual Memex model against a more traditional, category-based frame-\nwork with parametric inter-category relationships. One of the most recent and successful approaches\nis the CoLA model [8]. CoLA learns a set of parameters for each pair of categories which correspond\nto relative strengths of the four different top,above,below,inside spatial relationships. In the case of\ndealing with categories directly we consider a conditional distribution over the category of the hid-\nden object yi that factors as a star graph with K leaves (with the hidden object being connected to\n\n5\n\n\fall the supporting objects). \u03b8 are model parameters, \u03a8 is a pairwise potential that measures the com-\npatibility of two categories with a speci\ufb01ed spatial relationship, and Z is a normalization constant\nsuch that the conditional distribution sums to 1.\n\np(yi|y1, . . . , yK, f i1, . . . , f iK, \u03b8) =\n\n1\nZ\n\n\u03a8(yi, yj, f ij, \u03b8)\n\n(4)\n\nKY\n\nj=1\n\nFollowing [8], we use a feature function h(f) that computes the af\ufb01nity between feature f and a set\nof prototypical spatial relationships. We automatically \ufb01nd P prototypical spatial relationships by\nclustering all spatial feature vectors {f} in the training set via the popular Kmeans algorithm. Let\nh(f) \u2208 <P be the normalized vector of af\ufb01nities to cluster centers {c1, . . . , cP}. \u03b8 is the set of\nall parameters in this model, with \u03b8(yi, yj) \u2208 <P being the parameters associated with the pair of\ncategories (yi, yj).\n\nlog \u03a8(yi, yj, f ij, \u03b8) = [h(f ij)T ] \u03b8(yi, yj)\n\nhi(f) \u221d e\u2212\u03b1|| f \u2212ci||2\n\n(5)\n(6)\n\nWe tried using the four prototypical relationships corresponding to above, below, inside, and outside\nas in [8], but found that using Kmeans with signi\ufb01cantly larger number of prototypes P = 30\nproduced superior results. For learning \u03b8, we found the maximum likelihood \u03b8 using gradient\ndescent. The training objective function was optimized to mimic what happens during testing on\nthe Context Challenge task. Since the distributions for the Context Challenge task are de\ufb01ned with\nrespect to a single category variable (see Equation 4), we could compute the partition function\ndirectly and didn\u2019t require any approximations as in [8] (which required training in a loopy graph).\n\n4.3 Reduced KDE Memex Model\n\nSince the Visual Memex Model and the CoLA-inspired model make different assumptions with\nrespect to objects (category-based vs. exemplar-based) and context (parametric vs. nonparametric),\nwe feel it would also be useful to examine a hybrid model \u2013 dubbed the Reduced KDE Memex Model\n\u2013 which uses a nonparametric model of context but operates on object categories. The Reduced\nKDE Memex Model is created by collapsing all exemplars belonging to a single category into fully-\nconnected components which can be thought of as adding categories into the Visual Memex graph.\nIdentities between individual exemplars are lost, and thus we lose the \ufb01ne details of a spatial context.\nBy forming categories, we can no longer say a particular spatial relationship is between a blue side\nview of a car and an oblique brick building, we can only say it is a relationship between a car and\na building. Now that we are left with an unordered bag of spatial relationships {f} between two\ncategories, we need a way to measure compatibility between a newly observed f and the stored\nrelationships.\nWe use the same form of the Context Challenge conditional distribution as in Equation 4. We use\na Kernel Density Estimator(KDE) for every pair of categories, and the potential \u03a8 can be thought\nof as a matrix of such estimators. The use of nonparametric potentials in graphical models has been\nalready explored in the domain of texture analysis [24]. \u03b4ij is the Kronecker delta function.\n\nP\n\nP\n\nlog \u03a8(yi, yj, f ij) =\n\n(u,v)\u2208EC\n\n\u03b4yiyu \u03b4yj yv K(f ij, f uv)\n\n(u,v)\u2208EC\n\n\u03b4yiyu \u03b4yj yv\n\n(7)\n\nThe Reduced Memex model, being category-based and nonparametric, aggregates the spatial re-\nlationships across many different pairs of exemplars from two categories. While we used a \ufb01xed\nkernel K which measures distance isotropically across the dimensions of f, the advantage of such\na nonparametric approach is that with enough data the particularities of K do not matter. We also\nexperimented with a Nearest Neighbor based model, but found the Kernel Density Estimation ap-\nproach to be superior.\n\n6\n\n\fFigure 3: a.) Context Challenge confusion matrices for the 3 methods: Visual Memex, KDE, and\nCoLA. b.) Recognition Precision versus Recall when thresholding output based on con\ufb01dence. c)\nSide by side comparison of the 3 methods\u2019 accuracies for 30 categories.\n5 Results and Discussion\n\nFor the Context Challenge evaluation, we use 200 randomly selected densely labeled images from\nLabelMe [22]. Our testset contains 3048 total objects from 30 different categories. For an image\nwith K objects, we solve K Context Challenge problems with one hidden object and K-1 supporting\nobjects. Qualitative results on this prediction task can be seen in Figure 4.\nWe evaluate the performance of our Visual Memex model, the Reduced Memex KDE model, and the\nCoLA-inspired model with respect to categorization performance (confusion matrices can be seen in\ntop left of Figure 3). The overall recognition accuracy of the Visual Memex Model, Reduced Memex\nModel, and CoLA are .527, .430, and .457 respectively. Note that the Visual Memex Model performs\nsigni\ufb01cantly better than the baselines. Taking a closer look at the per-category accuracies of the three\nmethods (see bottom of Figure 3), we see that the CoLA-based method fails on many categories.\nThe average per-category recognition accuracies of the three methods are: .534, .454, and .213. The\nVisual Memex Model still performs the best, but we see a signi\ufb01cant drop in performance for the\ncategory-based CoLA method. CoLA is biased towards the popular categories, returning the most\nfrequently occurring category (window) quite often. Overall, the Visual Memex Model achieves the\nbest performance for 21 out of the 30 categories.\nIn addition, we plot precision recall curves for each of the three methods to determine if high con\ufb01-\ndence returned by each model is correlated with high recognition rates (top right of Figure 3). The\nVisual Memex model has the most signi\ufb01cant high-precision low-recall regime, suggesting that its\ncon\ufb01dence is a good measure of success. The relatively \ufb02at curve for the CoLA method is related to\nthe problem of overcompensation for popular classes as mentioned above. The distributions returned\nby CoLA tend to degenerate to a single non-zero value (most often on one of the popular categories\nsuch as window). This is why the maximum probability returned by CoLA isn\u2019t a good measure of\ncon\ufb01dence.\nWe also demonstrate the power of the Visual Memex to predict appearance solely based on con-\ntextual interactions with other objects and their visual appearance. The middle row of Figure 4\ndemonstrates some of these associations. Note how in row 1, a plausible viewpoint is selected rather\nthan just a random car. In row 3 we see that the appearance of snow on one mountain suggests\nthat the other portion of the image also contains a snowy mountain. In summary, we presented a\ncategory-free Visual Memex Model and applied it to the task of contextual object recognition within\nthe experimental framework of the Context Challenge. Our experiments con\ufb01rm our intuition that\nmoving beyond categories is bene\ufb01cial for improved modeling of relationships between objects.\nAcknowledgements. This research was in part funded by NSF CAREER award IIS-0546547, NSF\nGraduate Research Fellowship, a Guggenheim Fellowship, as well as generous gift from Google. A.\nEfros thanks the WILLOW team at ENS, Paris for their hospitality.\n\n7\n\nVisual MemexpersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarmpersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarmKDEpersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarmpersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarmCoLApersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarmpersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarm00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91RecallPrecisionContext Challenge Prediction Confidence  Visual MemexKDECoLA00.20.40.60.81Context Challenge Recognition Accuracy for 30 Categoriespersoncartreewindowheadbuildingskywallroadsidewalksignchairdoormountaintablefloorstreetlightlampplantpolebalconywheeltextgrasscolumnpanetrashblindgroundarm  Visual MemexKDECoLA\fFigure 4: Qualitative Results on the Context Challenge. Exemplar predictions are from the Visual\nMemex model and categorization results are from the Visual Memex model, the KDE Model, and\nCoLA[8].\n\n8\n\ntable\ufb02oorwalldoor\ufb02oortabledoorwallsidewalkcarroadpersonVisual Memex Exemplar PredictionsInput Image + Hidden RegionCategorization ResultsVisualMemexKDECoLAVisualMemexKDECoLAVisualMemexKDECoLAVisualMemexKDECoLAVisualMemexKDECoLAVisualMemexKDECoLAVisualMemexKDECoLA\fReferences\n[1] Moshe Bar and Shimon Ullman. Spatial context in recognition. Perception, 25:343\u2013352, 1996. 1\n[2] A.R. Hanson and E.M. Riseman. Visions: A computer system for interpreting scenes. Computer Vision\n\nSystems, pages 303\u2013333, 1978. 1\n\n[3] T.M. Strat and M.A. Fischler. Context-based vision: Recognizing objects using information from both\n\n2-d and 3-d imagery. PAMI, 13:1050\u20131065, 1991. 1\n\n[4] Xuming He, Richard S. Zemel, and Miguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Multiscale conditional random \ufb01elds\n\nfor image labeling. CVPR, pages 695\u2013702, 2004. 1\n\n[5] Sanjiv Kumar and Martial Hebert. A hierarchical \ufb01eld framework for uni\ufb01ed context-based classi\ufb01cation.\n\nICCV, 2005. 1\n\n[6] Jamie Shotton, John M. Winn, Carsten Rother, and Antonio Criminisi. Textonboost: Joint appearance,\n\nshape and context modeling for multi-class object recognition and segmentation. ECCV, 2006. 1\n\n[7] Andrew Rabinovich, Anrea Vedaldi, Carolina Galleguillos, Eric Wiewora, and Serge Belongie. Objects\n\nin context. ICCV, 2007. 1\n\n[8] Carolina Galleguillos, Andrew Rabinovich, and Serge Belongie. Object categorization using co-\n\noccurrence, location and appearance. ECCV, 2008. 1, 4, 5, 6, 8\n\n[9] Devi Parikh, C. Lawrence Zitnick, and Tsuhan Chen. From appearance to context-based recognition:\n\nDense labeling in small images. CVPR, 2008. 1\n\n[10] Bryan C. Russell, Antonio Torralba, Ce Liu, Rob Fergus, and William T. Freeman. Object recognition by\n\nscene alignment. NIPS, 2007. 1\n\n[11] Antonio Torralba. The context challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html.\n\n1, 4\n\n[12] Aristotle. Categories. 2\n[13] Ludwig Wittgenstein. Philosophical Investigations. Blackwell Publishing, 1953. 2\n[14] Eleanor Rosch. Principles of categorization. Cognition and Categorization, pages 27\u201348, 1978. 2\n[15] Shimon Edelman. Representation, similarity and the chorus of prototypes. Minds and Machines, 1995. 2\n[16] Ariadna Quattoni, M. Collins, and Trevor Darrell. Transfer learning for image classi\ufb01cation with sparse\n\nprototype representations. CVPR, 2008. 2\n\n[17] D. L. Medin and M.M. Schaffer. Context theory of classi\ufb01cation learning. Psychological Review, 85:207\u2013\n\n238, 1978. 2, 5\n\n[18] Robert M. Nosofsky. Attention, similarity, and the identi\ufb01cation-categorization relationship. Journal of\n\nExperimental Psychology: General, 115(1):39\u201357, 1986. 2\n\n[19] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances.\n\nCVPR, 2008. 2, 3, 4\n\n[20] Moshe Bar. The proactive brain: memory for predictions. Philosophical Transactions of the Royal Society\n\nB, 364:1235\u20131243, 2009. 2\n\n[21] Vannevar Bush. As we may think. The Atlantic Monthly, 1945. 2\n[22] Bryan Russell, Antonio Torralba, Kevin Murphy, , and William T. Freeman. Labelme: a database and\nweb-based tool for image annotation. International Journal of Computer Vision, 77:157\u2013173, 2008. 4, 7\n[23] Antonio Torralba. Contextual priming for object detection. International Journal of Computer Vision,\n\n53:169\u2013191, 2003. 4\n\n[24] Rupert Paget and I. D. Longstaff. Texture synthesis via a noncausal nonparametric multiscale markov\n\nrandom \ufb01eld. IEEE Transactions on Image Processing, 1998. 6\n\n9\n\n\f", "award": [], "sourceid": 86, "authors": [{"given_name": "Tomasz", "family_name": "Malisiewicz", "institution": null}, {"given_name": "Alyosha", "family_name": "Efros", "institution": null}]}