{"title": "Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships", "book": "Advances in Neural Information Processing Systems", "page_first": 1222, "page_last": 1230, "abstract": "The use of context is critical for scene understanding in computer vision, where the recognition of an object is driven by both local appearance and the objects relationship to other elements of the scene (context). Most current approaches rely on modeling the relationships between object categories as a source of context. In this paper we seek to move beyond categories to provide a richer appearance-based model of context. We present an exemplar-based model of objects and their relationships, the Visual Memex, that encodes both local appearance and 2D spatial context between object instances. We evaluate our model on Torralbas proposed Context Challenge against a baseline category-based system. Our experiments suggest that moving beyond categories for context modeling appears to be quite beneficial, and may be the critical missing ingredient in scene understanding systems.", "full_text": "Beyond Categories: The Visual Memex Model for\n\nReasoning About Object Relationships\n\nTomasz Malisiewicz, Alexei A. Efros\n\nRobotics Institute\n\nCarnegie Mellon University\n\n{tmalisie,efros}@cs.cmu.edu\n\nAbstract\n\nThe use of context is critical for scene understanding in computer vision, where\nthe recognition of an object is driven by both local appearance and the object\u2019s re-\nlationship to other elements of the scene (context). Most current approaches rely\non modeling the relationships between object categories as a source of context.\nIn this paper we seek to move beyond categories to provide a richer appearance-\nbased model of context. We present an exemplar-based model of objects and their\nrelationships, the Visual Memex, that encodes both local appearance and 2D spatial\ncontext between object instances. We evaluate our model on Torralba\u2019s proposed\nContext Challenge against a baseline category-based system. Our experiments\nsuggest that moving beyond categories for context modeling appears to be quite\nbene\ufb01cial, and may be the critical missing ingredient in scene understanding sys-\ntems.\n\n1 Introduction\n\nImage understanding is one of the Holy Grail problems in computer vision. Understanding a scene\narguably requires parsing the image into its constituent objects. In real scenes composed of many\ndifferent objects, the spatial con\ufb01guration of one object can facilitate recognition of related ob-\njects [1], and quite often ambiguities in recognition cannot be resolved without looking beyond the\nspatial extent of the object in question. Thus, algorithms which jointly recognize many objects at\nonce by taking account of contextual relationships have been quite popular. While early systems\nrelied on hand-coded rules for inter-object context (e.g. [2, 3]), more modern approaches typically\nperform inference in a probabilistic graphical model with respect to categories where object interac-\ntions are modeled as higher order potentials [4, 5, 6, 7, 8, 9, 10]. One important implicit assumption\nmade by all such models is that interactions between object instances can be adequately modeled as\nrelationships between human-de\ufb01ned object categories.\nIn this paper we challenge this \u201ccategory assumption\u201d for object-object interactions and propose\na novel category-free approach for modeling object relationships. We propose a new framework,\nthe Visual Memex Model, for representing and reasoning about object identities and their contex-\ntual relationships in an exemplar-based, non-parametric way. We evaluate our model on Antonio\nTorralba\u2019s proposed Context Challenge [11] against a baseline category-based system.\n\n2 Motivation\n\nThe use of categories (classes) to represent concepts (e.g. visual objects) is so prevalent in computer\nvision and machine learning that most researchers don\u2019t give it a second thought. Faced with a new\ntask, one simply carves up the solution space into classes (e.g. cars, people, buildings), assigns class\nlabels to training examples and applies one of the many popular classi\ufb01ers to arrive at a solution.\n\n1\n\n\fHowever, we believe that it is worthwhile to re-examine the basic assumption behind categorization,\nand especially its role in modeling relationships between objects.\nTheories of categorization date back to the ancient Greeks. Aristotle de\ufb01ned categories as discrete\nentities characterized by a set of properties shared by all their members [12]. His categories are\nmutually exclusive, and every member of a category is equal. This classical view is still the most\nwidely accepted way of reasoning about categories and taxonomies in hard sciences. However, as\npointed out by Wittgenstein, this is almost certainly not the way most of our everyday concepts work\n(e.g. what is the set of properties that de\ufb01ne the concept \u201cgame\u201d and nothing else? [13]). Empirical\nevidence for typicality (e.g. a robin is a more commonly cited example of \u201cbird\u201d than a chicken)\nand multiple category memberships (e.g. chicken is both \u201cbird\u201d and \u201cfood\u201d) further complicate the\nAristotelian view.\nThe ground-breaking work of cognitive psychologist Eleanor Rosch [14] demonstrated that humans\ndo not cut up the world into neat categories de\ufb01ned by shared properties, but instead use similarity\nas the basis of categorization. Her Prototype Theory postulates that an object\u2019s class is determined\nby its similarity to (a set of) prototypes which de\ufb01ne each category, allowing for varying degree of\nmembership. Such Prototype models have been successfully used for object recognition [15, 16].\nGoing even further, Exemplar Theory [17, 18] rejects the need for explicit category representation,\narguing instead that a concept can be implicitly formed via all its observed instances. This allows\nfor a dynamic de\ufb01nition of categories based on data availability and task (e.g. an object can be a\nvehicle, a car, a Volvo, or Bob\u2019s Volvo). A recent operationalization of the exemplar model in the\nvisual domain can be found in [19].\nBut it might not be too productive to concentrate on the various categorization theories without con-\nsidering the \ufb01nal aim \u2013 what do we need categories for? One argument is that categorization is a tool\nto facilitate knowledge transfer. E.g. having been attacked once by a tiger, it\u2019s critically important\nto determine if a newly observed object belongs to the tiger category so as to utilize the information\nfrom the previous encounter. Note that here recognizing the explicit category is unimportant, as\nlong as the two tigers could be associated with each other. Guided by this intuition and evidence\nfrom cognitive neuroscience, Bar [20] outlined the importance of analogies, associations, and pre-\ndiction in the human brain. He argues that the goal of visual perception is not to recognize an object\nin the traditional sense of categorizing it (i.e. asking \u2019what is this?\u2019), but instead linking the input\nwith an analogous representation in memory (i.e. asking \u2019what is this like?\u2019). Once a novel input is\nlinked with analogous representations, associated representations are activated rapidly and predict\nthe representations of what is most likely to occur next.\nThese ideas regarding analogies, associations, and prediction are surprisingly similar to Vannevar\nBush\u2019s 1945 concept of the Memex [21] \u2013 which was seen decades later as pioneering hypertext and\nthe World Wide Web. Concerned with the transmission and accessibility of scienti\ufb01c ideas, Bush\nfaulted the \u201carti\ufb01ciality of systems of indexing\u201d and proposed the Memory Extender (Memex), a\nphysical device which would help \ufb01nd information based on association instead of strict categorical\nindexing. The associative links were to be entered manually by the user and could be of several\ndifferent types. Chains of links would form into longer \u201cassociative trails\u201d creating new narratives\nin the concept space. For Bush \u201cthe process of tying two items together is the important thing.\u201d\nInspired by these diverse ideas that are, nonetheless, all pointing in the same general direction, we\nhave been motivated to try to evaluate them on a concrete problem, to see if they can offer bene\ufb01ts\nover the more traditional classi\ufb01cation framework. One particular area where we feel these ideas\nmight prove very useful is in modeling relationships between objects within an image. Therefore,\nin this paper we propose, in an homage to Bush, the Visual Memex Model, as a \ufb01rst step towards\noperationalizing the direct modeling of associations between visual objects, and compare it with\nmore standard tools for the same task.\n\n3 The Visual Memex Model\nOur starting point is Vannevar Bush\u2019s observation that strict categorical indexing of concepts has\nsevere limitations. Abandoning rigid object categories, we embrace Bush\u2019s and Bar\u2019s belief in the\nprimary role of associations, but unlike Bush, we aim to discover these associations automatically\nfrom the data. At the core of our model is an exemplar-based representation of objects [18, 19]. The\nVisual Memex can then be thought of as a vast graph, with nodes representing all the object instances\n\n2\n\n\fFigure 1: The Visual Memex graph encodes object similarity (solid black edge) and spatial context\n(dotted red edge) between pairs of object exemplars. A spatial context feature is stored for each\ncontext edge. The Memex graph can be used to interpret a new image (left) by associating image\nsegments with exemplars in the graph (orange edges) and propagating the information. Figure best\nviewed in color.\n\nin the dataset, and arcs representing the different types of associations between them (Figure 1).\nThere are two types of arcs in our model, encoding two different relationships between objects: 1)\nvisual similarity (e.g. this car looks like that car), and 2) contextual associations (e.g. this car is next\nto this building).\nOnce the graph is built, it can be used to interpret a novel image (Figure 1, left) by \ufb01rst connecting\nsegments within the image with similar stored exemplars, and then propagating contextual informa-\ntion between these exemplars through the graph. When an exemplar gets activated, visually similar\nexemplars as well as other contextually relevant objects get activated as well. This way, exemplar-\nto-exemplar similarity in the Memex graph can serve as Bush\u2019s \u201ctrails\u201d to link concepts together\nin a non-parametric, query-dependent way, without the use of prede\ufb01ned categories. For exam-\nple, in Figure 1, we should be able to infer that a car seen from the rear often co-occurs with an\noblique building wall (but not a frontal wall) \u2013 something which category-based models would be\nhard-pressed to achieve.\nFormally, we de\ufb01ne the Visual Memex Model as a graph G = (V, ES, EC,{D},{f}) consisting\nof N object exemplar nodes V , similarity edges ES, context edges EC, N per-exemplar similarity\nfunctions {D}, and the spatial features {f} associated with each context edge. We now describe\nhow to learn the similarity functions {D} from data to create the structure of the Visual Memex.\n\n3.1 Similarity Edges\n\nWe use the per-exemplar distance-function learning algorithm of Malisiewicz et al [19] to learn the\nobject similarity edges. For each exemplar, the algorithm learns which other exemplars it is similar\nto as well as a distance function. A distance function is a linear combination of elementary distances\nused to measure similarity to the exemplar. We use the same 14 color, shape, texture, and location\nfeatures as used in [19]. For the j-th exemplar, wj is the vector of 14 weights, bj is a scalar bias,\nand \u03b1j \u2208 {0, 1}|C| is a binary indicator vector which encodes which other exemplars the current\nexemplar is similar to. We solve [w\u2217\nj ] = arg minw,b,\u03b1fj(w, b, \u03b1), but since the exemplars\u2019\noptimization problems are independent we drop the j suf\ufb01x for clarity. Let di be the vector of 14\nEuclidean distances between the exemplar whose similarity we are learning (the focal exemplar)\nand the i-th exemplar. C is the set of exemplars that have the same label as the focal exemplar. Let\nL(x) = max(1 \u2212 x, 0)2 be the hinge-squared loss function. A different w, b, and \u03b1 are learned\nper-exemplar by optimizing the following functional:\n\nj , \u03b1\u2217\n\nj , b\u2217\n\nL(wT di + b) \u2212 \u03c3|| \u03b1||2\n\n(1)\n\n||w||2 +X\n\n\u03b1i L(\u2212(wT di + b)) +X\n\nf(w, b, \u03b1) = \u03bb\n2\n\ni\u2208C\n\ni /\u2208C\n\nWe minimize the above SVM-like objective function via an alternating optimization strategy as in\n[19]. The algorithm uses labels (see Section 3.3) during learning where the regularization term favors\nconnecting to many similarly-labeled exemplars and the loss term favors separability in distance\n\n3\n\n\fFigure 2: Torralba\u2019s Context Challenge: \u201cHow far can you go without running a local object de-\ntector?\u201d The task is to reason about the identity of the hidden object (denoted by a \u201c?\u201d) without\nlocal information. In our category-free Visual Memex model, object predictions are generated in the\nform of exemplar associations for the hidden object. In a category-based model, the category of the\nhidden object is directly estimated.\n\nspace. We create a similarity edge between two exemplars if they are deemed similar by each\nothers\u2019 distance functions. We use a \ufb01xed \u03bb = .00001 and \u03c3 = 100 for all exemplars.\n\n3.2 Context Edges\n\nWhen two objects occur inside a single image, we encode their 2-D spatial relationship into a con-\ntext feature vector f \u2208 <10 (visualized as red dotted edges in Figure 1). The context feature vector\nencodes relative overlap, relative displacement, relative scale, and relative height of the bottom-most\npixel between two exemplar regions in a single image. This feature captures the spatial relationship\nbetween two regions and does not take into account any appearance information \u2013 it is a general-\nization of the spatial features used in [8]. We measure the similarity between two context features\nusing a Gaussian kernel: K(f , f0) = e\u2212\u03b11|| f \u2212 f0 ||2 with \u03b11 = 1.0.\n\n3.3 Building the Visual Memex\n\nWe extract a large database of exemplar objects and their ground-truth segmentation masks from\nthe LabelMe [22] dataset and learn the structure of the Visual Memex in an of\ufb02ine setting. We use\nobjects from the 30 most frequently occurring categories in LabelMe. Similarity edges are created\nusing the per-exemplar distance function learning framework of [19], and context edges are created\neach time two exemplars are observed in the same image. We have a total of N = 87, 802 exemplars\nin the Visual Memex, |ES| = 276, 782 similarity edges, and |EC| = 989, 106 context edges.\n\n4 Evaluating on the Context Challenge\nThe intuition that we would like to evaluate is that many useful regularities of the visual world are\nlost when dealing solely with categories (e.g. the side view of a building should associate more with\na side view of a car than a frontal view of a car). The key motivation behind the Visual Memex is\nthat context should depend on the appearance of an object and not just the category it belongs to. In\norder to test this hypothesis against the commonly held practice of abstracting away appearance into\ncategories, we need a rich evaluation dataset as well as a meaningful evaluation task.\nWe found that the Context Challenge [11] recently proposed by Antonio Torralba \ufb01ts our needs\nperfectly. The evaluation task is inspired by the question: \u201cHow far can you go without running an\nobject detector?\u201d The goal is to recognize a single object in the image without peeking at pixels\nbelonging to that object. Torralba presented an algorithm for predicting the category and scale of\nan object using only contextual information [23], but his notion of context is scene-centered (where\nthe appearance of the entire image is used for prediction). Since the context we wish study in this\npaper is object-centered, we use an object-centered formulation of the Context Challenge. While\nit is not clear if the absolute performance numbers on the Context Challenge are very meaningful\nin themselves, we feel that it is an ideal task for studying object-centered context and the role of\ncategorization assumptions in such models.\n\n4\n\nassociations?cardoorroadwindowwindowtreecartreewindowwheelwheel?CategoryEstimationcarpersonwheelroadwindowfencebuildingsidewalkdoortree\fIn our variant of the Context Challenge, the goal is to predict the category of a hidden object yi solely\nbased on its spatial relationships to some provided objects \u2013 without using the pixels belonging to\nthe hidden object at all. For our study, we use manually provided regions and category labels of K\nsupporting objects inside a single image. We refer to the identities of the K supporting objects in the\nimage as {y1, . . . , yK} (where y \u2208 {1, . . . ,|C|}) and the set of K 2D spatial relationship features\nbetween each supporting object and the hidden object as {f i1, . . . , f iK}.\n\n4.1\n\nInference in the Visual Memex Model\n\nIn this section, we explain how to use the Visual Memex graph (automatically constructed from\ndata) to perform inference for the Context Challenge hidden-object prediction task. Not making the\n\u201ccategory assumption,\u201d the model is de\ufb01ned with respect to exemplar associations for the hidden\nobject. Inference in the model returns a compatibility score between every exemplar and the hidden\nobject, and can be though of as returning an ordered list of exemplar associations. Due to the nature\nof exemplar associations as opposed to category assignments, a supporting object can be associated\nwith multiple exemplars as opposed to a single category. We create soft exemplar associations\nbetween each of the supporting objects and the exemplars in the Visual Memex using the similarity\nfunctions {D} (see Section 3.1).\n{S1, . . . , SK} are the appearance features for the K supporting objects. Aa\nj is the af\ufb01nity between\nexemplar a in the Visual Memex and the j-th supporting object and is created by evaluating Sj\nj = e\u2212Da(Sj ). \u03a8(ei, ej, f ij) is the pairwise compatibility between\nunder a\u2019s distance function Aa\nexemplar ei and ej under the spatial feature f ij. Let Wab be the adjacency matrix representation\nof the similarity edges (Wuv = [(u, v) \u2208 ES]). Inference in the Visual Memex Model is done\nby optimizing the following conditional distribution which scores the assignment of an arbitrary\nexemplar ei to the hidden object based on contextual relations:\n\np(ei|A1, . . . , AK, f i1, . . . , f iK) \u221d KY\nP\n\nj=1\n\nlog \u03a8(ei, ej, f ij) =\n\nNX\nP\n\na=1\n(u,v)\u2208EC\n\nAa\n\nj \u03a8(ei, ea, f ij)\n\nWiuWjvK(f ij, f uv)\n\n(u,v)\u2208EC\n\nWiuWjv\n\n(2)\n\n(3)\n\nThe reason for the summation inside Equation 3 is that it aggregates contextual interactions from\nsimilar exemplars. By doing this, we effectively \u201cdensify\u201d the contextual interactions in the Visual\nMemex. An interpretation of this densi\ufb01cation procedure is that we are creating a kernel density\nestimator for an arbitrary pair of exemplars (ei, ej) via a weighted sum of kernels placed at context\nfeatures in the data set {f uv} : (u, v) \u2208 EC where the weights WiuWjv measure visual similarity\nbetween pairs (ei, ej) and (eu, ev) .\nWe experimented with using a single kernel, \u03a8(ei, ej| f ij) = K(f ij, f ei,ej ), and found that the in-\ntegration of multiple features via the densi\ufb01cation described above is a key ingredient for successful\nVisual Memex inference.\nFinally, after performing inference in the Visual Memex Model, we are left with a score for each ex-\nemplar. At this stage, as far as our model is concerned, the recognition has already been performed.\nHowever, since the task we are evaluated on is category-based, we combine the returned exemplars\ninto a vote for categories using Luce\u2019s Axiom of Choice [17] which averages the exemplar responses\nper-category.\n\n4.2 CoLA-based Parametric Model\n\nWe would like to evaluate the Visual Memex model against a more traditional, category-based frame-\nwork with parametric inter-category relationships. One of the most recent and successful approaches\nis the CoLA model [8]. CoLA learns a set of parameters for each pair of categories which correspond\nto relative strengths of the four different top,above,below,inside spatial relationships. In the case of\ndealing with categories directly we consider a conditional distribution over the category of the hid-\nden object yi that factors as a star graph with K leaves (with the hidden object being connected to\n\n5\n\n\fall the supporting objects). \u03b8 are model parameters, \u03a8 is a pairwise potential that measures the com-\npatibility of two categories with a speci\ufb01ed spatial relationship, and Z is a normalization constant\nsuch that the conditional distribution sums to 1.\n\np(yi|y1, . . . , yK, f i1, . . . , f iK, \u03b8) =\n\n1\nZ\n\n\u03a8(yi, yj, f ij, \u03b8)\n\n(4)\n\nKY\n\nj=1\n\nFollowing [8], we use a feature function h(f) that computes the af\ufb01nity between feature f and a set\nof prototypical spatial relationships. We automatically \ufb01nd P prototypical spatial relationships by\nclustering all spatial feature vectors {f} in the training set via the popular Kmeans algorithm. Let\nh(f) \u2208