{"title": "Learning models of object structure", "book": "Advances in Neural Information Processing Systems", "page_first": 1615, "page_last": 1623, "abstract": "We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g.  chairs). Fitting learned topologies to an image can be used to identify the object class, as well as detail its geometry. The latter goes beyond labeling objects, as it provides the geometric structure of particular instances.  We learn the models using joint statistical inference over structure parameters, camera parameters, and instance parameters. These produce an image likelihood through a statistical imaging model. We use trans-dimensional sampling to explore topology hypotheses, and alternate between Metropolis-Hastings and stochastic dynamics to explore instance parameters. Experiments on images of furniture objects such as tables and chairs suggest that this is an effective approach for learning models that encode simple representations of category geometry and the statistics thereof, and support inferring both category and geometry on held out single view images.", "full_text": "Learning models of object structure\n\nJoseph Schlecht\n\nKobus Barnard\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Arizona\n\nUniversity of Arizona\n\nschlecht@cs.arizona.edu\n\nkobus@cs.arizona.edu\n\nAbstract\n\nWe present an approach for learning stochastic geometric models of object cat-\negories from single view images. We focus here on models expressible as a\nspatially contiguous assemblage of blocks. Model topologies are learned across\ngroups of images, and one or more such topologies is linked to an object cate-\ngory (e.g. chairs). Fitting learned topologies to an image can be used to identify\nthe object class, as well as detail its geometry. The latter goes beyond labeling\nobjects, as it provides the geometric structure of particular instances. We learn\nthe models using joint statistical inference over category parameters, camera pa-\nrameters, and instance parameters. These produce an image likelihood through a\nstatistical imaging model. We use trans-dimensional sampling to explore topology\nhypotheses, and alternate between Metropolis-Hastings and stochastic dynamics\nto explore instance parameters. Experiments on images of furniture objects such\nas tables and chairs suggest that this is an effective approach for learning models\nthat encode simple representations of category geometry and the statistics thereof,\nand support inferring both category and geometry on held out single view images.\n\n1\n\nIntroduction\n\nIn this paper we develop an approach to learn stochastic 3D geometric models of object categories\nfrom single view images. Exploiting such models for object recognition systems enables going\nbeyond simple labeling. In particular, \ufb01tting such models opens up opportunities to reason about\nfunction or utility, how the particular object integrates into the scene (i.e., perhaps it is an obsta-\ncle), how the form of the particular instance is related to others in its category (i.e., perhaps it is a\nparticularly tall and narrow one), and how categories themselves are related.\n\nCapturing the wide variation in both topology and geometry within object categories, and \ufb01nding\ngood estimates for the underlying statistics, suggests a large scale learning approach. We propose\nexploiting the growing number of labeled single-view images to learn such models. While our\napproach is trivially extendable to exploit multiple views of the same object, large quantities of such\ndata is rare. Further, the key issue is to learn about the variation of the category. Put differently,\nif we are limited to 100 images, we would prefer to have 100 images of different examples, rather\nthan, say, 10 views of 10 examples.\n\nRepresenting, learning, and using object statistical geometric properties is potentially simpler in the\ncontext of 3D models. In contrast, statistical models that encode image-based appearance character-\nistics and/or part con\ufb01guration statistics must deal with confounds due to the imaging process. For\nexample, right angles in 3D can have a wide variety of angles in the image plane, leading to using\nthe same representations for both structure variation and pose variation. This means that the repre-\nsented geometry is less speci\ufb01c and less informative. By contrast, encoding the structure variation\nin 3D models is simpler and more informative because they are linked to the object alone.\n\nTo deal with the effect of an unknown camera, we estimate the camera parameters simultaneously\nwhile \ufb01tting the model hypothesis. A 3D model hypothesis is a relatively strong hint as to what\n\n1\n\n\fthe camera might be. Further, we make the observation that the variations due to standard camera\nprojection are quite unlike typical category variation. Hence, in the context of a given object model\nhypothesis, the fact that the camera is not known is not a signi\ufb01cant impediment, and much can be\nestimated about the camera under that hypothesis.\n\nWe develop our approach with object models that are expressible as a spatially contiguous assem-\nblage of blocks. We include in the model a constraint on right angles between blocks. We further\nsimplify matters by considering images where there are minimal distracting features in the back-\nground. We experiment with images from \ufb01ve categories of furniture objects. Within this domain,\nwe are able to automatically learn topologies. The models can then be used to identify the object\ncategory using statistical inference. Recognition of objects in clutter is likely effective with this ap-\nproach, but we have yet to integrate support for occlusion of object parts into our inference process.\n\nWe learn the parameters of each category model using Bayesian inference over multiple image\nexamples for the category. Thus we have a number of parameters specifying the category topology\nthat apply to all images of objects from the category. Further, as a side effect, the inference process\n\ufb01nds instance parameters that apply speci\ufb01cally to each object. For example, all tables have legs and\na top, but the proportions of the parts differ among the instances. In addition, the camera parameters\nfor each image are determined, as these are simultaneously \ufb01t with the object models. The object\nand camera hypotheses are combined with an imaging model to provide the image likelihood that\ndrives the inference process.\n\nFor learning we need to \ufb01nd parameters that give a high likelihood of the data from multiple ex-\namples. Because we are searching for model topologies, we need to search among models with\nvarying dimension. For this we use the trans-dimensional sampling framework [7, 8]. We explore\nthe posterior space within a given probability space of a particular dimension by combining standard\nMetropolis-Hastings [1, 14], with stochastic dynamics [18]. As developed further below, these two\nmethods have complementary strengths for our problem. Importantly, we arrange the sampling so\nthat the hybrid of samplers are guaranteed to converge to the posterior distribution. This ensures that\nthe space will be completely explored, given enough time.\n\nRelated work. Most work on learning representations for object categories has focused on image-\nbased appearance characteristics and/or part con\ufb01guration statistics (e.g., [4, 5, 6, 12, 13, 24]).\nThese approaches typically rely on effective descriptors that are somewhat resilient to pose\nchange (e.g., [16]). A second force favoring learning 2D representations is the explosion of read-\nily available images compared with that for 3D structure, and thus treating category learning as\nstatistical pattern recognition is more convenient in the data domain (2D images). However, some\nresearchers have started imposing more projective geometry into the spatial models. For example,\nSavarese and Fei-Fei [19, 20] build a model where arranged parts are linked by a fundamental ma-\ntrix. Their training process is helped by multiple examples of the same objects, but notably they\nare able to use training data with clutter. Their approach is different than ours in that models are\nbuilt more bottom up, and this process is somewhat reliant on the presence of surface textures. A\ndifferent strategy proposed by Hoeim et al. [9] is to \ufb01t a deformable 3D blob to cars, driven largely\nby appearance cues mapped onto the model. Our work also relates to recent efforts in learning ab-\nstract topologies [11, 26] and structure models for 2D images of objects constrained by grammar\nrepresentations [29, 30]. Also relevant is a large body of older work on representing objects with\n3D parts [2, 3, 28] and detecting objects in images given a precise 3D model [10, 15, 25], such\nas one for machined parts in an industrial setting. Finally, we have also been inspired by work\non \ufb01tting deformable models of known topology to 2D images in the case of human pose estima-\ntion (e.g., [17, 22, 23]).\n\n2 Modeling object category structure\n\nWe use a generative model for image features corresponding to examples from object categories\n(Fig. 1). A category is associated with a sampling from category level parameters which are the\nnumber of parts, n, their interconnections (topology), t, the structure statistics rs, and the camera\nstatistics, rs. Associating camera distributional parameters with a category allows us to exploit\nregularity in how different objects are photographed during learning. We support clusters within\ncategories to model multiple structural possibilities (e.g., chairs with and without arm rests). The\ncluster variable, z, selects a category topology and structure distributional parameters for attachment\nlocations and part sizes. We denote the speci\ufb01c values for a particular example by s. Similarly, we\n\n2\n\n\fD\n\nn\n\ndc\n\nt\n\ndx\n\nz\n\nd\n\ns\nd\n\nrc\n\nrs\n\n\u00b5 c\n\n\u03a3\n\nc\n\n\u03c0\n\n\u00b5\n\ns\n\n\u03a3\n\ns\n\nFigure 1: Graphical model for the generative approach\nto images of objects from categories described by\nstochastic geometric models. The category level param-\neters are the number of parts, n, their interconnections\n(topology), t, the structure statistics rs, and the camera\nstatistics, rs. Hyperparameters for category level pa-\nrameters are omitted for clarity. A sample of category\nlevel parameters provides a statistical model for a given\ncategory, which is then sampled for the camera and ob-\nject structure values cd and sd, optionally selected from\na cluster within the category by zd. cd and sd yield a\ndistribution over image features xd.\n\ndenote the camera capturing it by c. The projected model image then generates image features,\nx, for which we use edge points and surface pixels. In summary, the parameters for an image are\n\u03b8(n) = (c, s, t, rc, rs, n).\n\nGiven a set of D images containing examples of an object category, our goal is to learn the model\n\u0398(n) generating them from detected features sets X = x1, . . . , xD. In addition to category-level\nparameters shared across instances which is of most interest, \u0398(n) comprises camera models C =\nc1, . . . , cD and structure part parameters S = s1, . . . , sD assuming a hard cluster assignment. In\nother words, the camera and the geometry of the training examples are \ufb01t collaterally.\n\nWe separate the joint density into a likelihood and prior\n\np\u00b3X, \u0398(n)\u00b4 = p(n)(X, C, S | t, rc, rs) p(n)(t, rc, rs, n) ,\n\n(1)\n\nwhere we use the notation p(n)(\u00b7) for a density function corresponding to n parts. Conditioned on\nthe category parameters, we assume that the D sets of image features and instance parameters are\nindependent, giving\n\np(n)(X, C, S | t, rc, rs) =\n\np(n)(xd, cd, sd | t, rc, rs) .\n\n(2)\n\nDYd=1\n\nMYm=1\n\nThe feature data and structure parameters are generated by a sub-category cluster with weights and\ndistributions de\ufb01ned by rs = (\u03c0, \u00b5s, \u03a3s). As previously mentioned, the camera is shared across\nclusters, and drawn from a distribution de\ufb01ned by rc = (\u00b5c, \u03a3c). We formalize the likelihood of\nan object, camera, and image features under M clusters as\n\np(n)(xd, cd, sd | t, rc, rs)\n\n=\n\nMXm=1\n\n\u03c0m p(nm)(xd | cd, smd)\n\np(cd | \u00b5c, \u03a3c)\n\np(nm)(smd | tm, \u00b5sm, \u03a3sm)\n\n(3)\n\n.\n\n|\n\nImage\n\n{z\n\n}\n\n|\n\nCamera\n\n{z\n\n}\n\n|\n\nObject\n\n{z\n\n}\n\nWe arrive at equation (3) by introducing a binary assignment vector z for each image feature set,\nsuch that zm = 1 if the mth cluster generated it and 0 otherwise. The cluster weights are then given\nby \u03c0m = p(zm = 1) .\n\nFor the prior probability distribution, we assume category parameter independence, with the clus-\ntered topologies conditionally independent given the number of parts. The prior in (1) becomes\n\np(n)(t, rc, rs, n) = p(rc)\n\np(nm)(tm | nm) p(nm)(rsm) p(nm) .\n\n(4)\n\nFor category parameters in the camera and structure models, rc and rs, we use Gaussian statistics\nwith weak Gamma priors that are empirically chosen. We set the number of parts in the object sub-\ncategories, n to be geometrically distributed. We set the prior over edges in the topology given n to\nbe uniform.\n\n2.1 Object model\n\nWe model object structure as a set of connected three-dimensional block constructs representing\nobject parts. We account for symmetric structure in an object category, e.g., legs of a table or chair,\n\n3\n\n\ff,s\n\n\u03d1\n\nd\n\nz\n\ny\n\nx\n\nFigure 2: The camera model is constrained to reduce the ambiguity intro-\nduced in learning from a single view of an object. We position the camera at\na \ufb01xed distance and direct its focus at the origin; rotation is allowed about the\nx-axis. Since the object model is allowed to move about the scene and rotate,\nthis model is capable of capturing most images of a scene.\n\nby introducing compound block constructs. We de\ufb01ne two constructs for symmetrically aligned\npairs (2) or quartets (4) of blocks. Unless otherwise speci\ufb01ed, we will use blocks to specify both\nsimple blocks and compound blocks as they handled similarly.\n\nThe connections between blocks are made at a point on adjacent, parallel faces. We consider the\norganization of these connections as a graph de\ufb01ning the structural topology of an object category,\nwhere the nodes in the graph represent structural parts and the edges give the connections. We use\ndirected edges, inducing attachment dependence among parts.\n\nEach block has three internal parameters representing its width, height, and length. Blocks repre-\nsenting symmetric pairs or quartets have one or two extra parameters de\ufb01ning the relative positioning\nof the sub-blocks Blocks potentially have two external attachment parameters u, v where one other\nis connected. We further constrain blocks to attach to at most one other block, giving a directed tree\nfor the topology and enabling conditional independence among attachments. Note that blocks can\nbe visually \u201cattached\u201d to additional blocks that they abut, but representing them as true attachments\nmakes the model more complex and is not necessary. Intuitively, the model is much like physically\nbuilding a piece of furniture block by block, but saving on glue by only connecting an added block\nto one other block. Despite its simplicity, this model can approximate a surprising range of man\nmade objects.\n\nFor a set of n connected blocks of the form b = (w, h, l, u1, v1, . . .), the structure model is\ns = (\u03d5, po, b1, . . . , bn). We position the connected blocks in an object coordinate system de\ufb01ned\nby a point po \u2208 R3 on one of the blocks and a y-axis rotation angle, \u03d5, about this position. Since\nwe constrain the blocks to be connected at right angles on parallel faces, the position of other blocks\nwithin the object coordinate system is entirely de\ufb01ned by po and the attachments points between\nblocks.\n\nThe object structure instance parameters are assumed Gaussian distributed according to \u00b5s, \u03a3s in\nthe likelihood (3). Since the instance parameters in the object model are conditionally independent\ngiven the category, the covariance matrix is diagonal. Finally, for a block bi attaching to bj on faces\n\nde\ufb01ned by the kth size parameter, the topology edge set is de\ufb01ned as t =\u00b3i, j, k : bi\n\nk\u2190\u2212 bj\u00b4.\n\n2.2 Camera model\n\nA full speci\ufb01cation of the camera and the object position, pose, and scale leads to a redundant set\nof parameters. We choose a minimal set for inference that retains full expressiveness as follows.\nSince we are unable to distinguish the actual size of an object from its distance to the camera, we\nconstrain the camera to be at a \ufb01xed distance from the world origin. We reduce potential ambiguity\nfrom objects of interest being variably positioned in R3 by constraining the camera to always look\nat the world origin. Because we allow an object to rotate around its vertical axis, we only need to\nspecify the camera zenith angle, \u03d1. Thus we set the horizontal x-coordinate of the camera in the\nworld to zero and allow \u03d1 to be the only variable extrinsic parameter. In other words, the position\nof the camera is constrained to a circular arc on the y, z-plane (Figure 2). We model the amount of\nperspective in the image from the camera by parameterizing its focal length, f . Our camera instance\nparameters are thus c = (\u03d1, f, s), where \u03d1 \u2208 [\u2212\u03c0/2, \u03c0/2], and f, s > 0. The camera instance\nparameters in (3) are modeled as Gaussian with category parameters \u00b5s, \u03a3s.\n\n2.3\n\nImage model\n\nWe represent an image as a collection of detected feature sets that are statistically generated by an\ninstance of our object and camera. Each image feature sets as arising from a corresponding feature\ngenerator that depends on projected object information. For this work we generate edge points from\nprojected object contours and image foreground from colored surface points (Figure 3).\n\n4\n\n\fProjected Surface\n\nFg Detection\n\ns\n\nxi(\n\u03b8\n\n)\n\ne (\n\u03b8\n\nx\n\n)\n\ni\n\nImage Data\n\nObject Model\n\nProjected Contours\n\nEdge Detection\n\nFigure 3: Example of the generative im-\nage model for detected features. The\nleft side of the \ufb01gure gives a rendering\nof the object and camera models \ufb01t to\nthe image on the right side. The right-\nward arrows show the process of statis-\ntical generation of image features. The\nleftward arrows are feature detection in\nthe image data.\n\nWe assume that feature responses are conditionally independent given the model and that the G\ndifferent types of features are also independent. Denoting the detected feature sets in the dth image\nby xd = xd1, . . . , xdG, we expand the image component of equation (3) to\n\np(nm)(xd | cd, smd, tm) =\n\nf (nm)\n\n\u03b8g\n\n(xdgi) .\n\n(5)\n\nGYg=1\n\nNxYi=1\n\n\u03b8g\n\nThe function f (nm)\n(\u00b7) measures the likelihood of a feature generator producing the response of a\ndetector at each pixel using our object and camera models. Effective construction and implementa-\ntion of the edge and surface point generators is intricate, and thus we only brie\ufb02y summarize them.\nPlease refer to our technical report [21] for more details.\n\nEdge point generator. We model edge point location and orientation as generated from projected\n3D contours of our object model. Since the feature generator likelihood in (5) is computed over all\ndetection responses in an image, we de\ufb01ne the edge generator likelihood as\n\nf\u03b8(xi) =\n\ne\u03b8(xi)Ei \u00b7 e\u2032\n\n\u03b8(xi)(1\u2212Ei) ,\n\n(6)\n\nNxYi=1\n\nNxYi=1\n\nwhere the probability density function e\u03b8(\u00b7) gives the likelihood of detected edge point at the ith\npixel, and e\u2032\n\u03b8(\u00b7) is the density for pixel locations not containing an edge point. The indicator Ei is 1\nif the pixel is an edge point and 0 otherwise. This can be approximated by [21]\n\nNxYi=1\n\nf\u03b8(xi) \u2248( NxYi=1ee\u03b8(xi)Ei) eNbg\n\nbg eNmiss\nmiss ,\n\n(7)\n\nwhere eNbg\n\nbg\n\nand eNmiss\n\nmiss are the probabilities of background and missing detections and Nbg and Nmiss\n\nthe most likely correspondence between observed edge points and model edges.\n\nare the number of background and missing detections. The densityee\u03b8 approximates e\u03b8 by estimating\n\nTo compute the edge point density e\u03b8, we assume correspondence and use the ith edge point gen-\nerated from the jth model point as a Gaussian distributed displacement dij in the direction perpen-\ndicular of the projected model contour. We further de\ufb01ne the gradient direction of the generated\nedge point to have Gaussian error in its angle difference \u03c6ij with the perpendicular direction of the\nprojected contour. If mj is a the model point assumed to generate xi, then\n\ncos\u22121\u00a1gT\n\n(8)\nwhere the perpendicular distance between xi and mj and angular difference between edge point\ngradient gi and model contour perpendicular vj are de\ufb01ned dij = k xi \u2212 mj k and \u03c6ij =\n\ne\u03b8(xi) = ce N (dij; 0, \u03c3d) N (\u03c6ij; 0, \u03c3\u03c6)\n\ni vj/kgik kvjk\u00a2. The range of dij is \u2265 0, and the angle \u03c6ij is in [0, 1].\n\nSurface point generator. Surface points are the projected points of viewable surfaces in our ob-\nject model. Image foreground pixels are found using k-means clustering on pixel intensities. Setting\nk = 2 works well as our training images were selected to have minimal clutter. Surface point detec-\ntions intersecting with model surface projection leads to four easily identi\ufb01able cases: foreground,\nbackground, missing, and noise. Similar to the edge point generator, the surface point generator\nlikelihood expands to\n\nf\u03b8(xi) = sNfg\n\nfg sNbg\n\nbg sNnoise\n\nnoise sNmiss\nmiss ,\n\n(9)\n\nNxYi=1\n\n5\n\n\f3 Learning\n\nTo learn a category model, we sample the posterior, p\u00a1\u0398(n) | X\u00a2 \u221d p\u00a1X, \u0398(n)\u00a2, to \ufb01nd good pa-\n\nrameters shared by images of multiple object examples from the category. Given enough iterations,\na good sampler converges to the target distribution and an optimal value can be readily discovered\nin the process. However, our posterior distribution is highly convoluted with many sharp, narrow\nridges for close \ufb01ts to the edge points and foreground. In our domain, as in many similar problems,\nstandard sampling techniques tend to get trapped in these local extrema for long periods of time.\nOur strategy for inference is to combine a mixture of sampling techniques with different strengths\nin exploring the posterior distribution while still maintaining convergence conditions.\n\nOur sampling space is over all category and instance parameters for a set of input images. We denote\nthe space over an instance of the camera and object models with n parts as C \u00d7 S(n). Let T(n) be\nthe space over all topologies and R(n)\nover all category statistics. The complete sampling\nspace with m subcategories and D instances is then de\ufb01ned as\n\nc \u00d7 R(n)\n\ns\n\n\u2126 = [n \u2208 Nm\n\nCD \u00d7 S(n)D \u00d7 T(n) \u00d7 R(n)\n\nc \u00d7 R(n)\n\ns\n\n,\n\n(10)\n\nOur goal is to sample the posterior with \u0398(n) \u2208 \u2126 such that we \ufb01nd the set of parameters that\nmaximizes it. Since the number of parameters in the sampling space is a unknown, some proposals\nmust change the model dimension. In particular, these jump moves (following the terminology of Tu\nand Zhu [27]) arise from changes in topology. Diffusion moves make changes to parameters within\na given topology. We cycle between the two kinds of moves.\n\nDiffusion moves for sampling within topology. We found that a multivariate Gaussian with small\ncovariance values on the diagonal to be a good proposal distribution for the instance parameters.\nProposals for block size changes are done in one of two ways: scaling or shifting attached blocks.\nWe found that both are useful good exploration of the object structure parameter space. Category\nparameters were sampled by making proposals from the Gamma priors.\n\nUsing standard Metropolis-Hastings (MH) [1, 14], the proposed moves are accepted with probability\n\n\u03b1\u00b3 \u02dc\u03b8(n)\u00b4 = min(1,\n\np( \u02dc\u03b8(n) | X) q(\u03b8(n) | \u02dc\u03b8(n))\n\np(\u03b8(n) | X) q( \u02dc\u03b8(n) | \u03b8(n))) .\n\n(11)\n\nThe MH diffusion moves exhibit a random walk behavior and can take extended periods of time\nwith many rejections to converge and properly mix well in regions of high probability in the target\ndistribution. Hence we occasionally follow a hybrid Markov chain based on stochastic dynamics,\nwhere our joint density is used in a potential energy function. We use the common leapfrog dis-\ncretization [18] to follow the dynamics and sample from phase space. The necessary derivative\ncalculations are approximated using numerical differentiation (details in [21]).\n\nJump moves for topology changes. For jump moves, we use the trans-dimensional sampling ap-\nproach outlined by Green [7]. For example, in the case of a block birth in the model, we modify the\nstandard MH acceptance probability to\n\n\u03b1\u00b3 \u02dc\u03b8(n+1)\u00b4 = min(1,\n\np( \u02dc\u03b8(n+1) | X)\n\np(\u03b8(n) | X) q(\u02dcb, \u02dct)\n\nrd\n\nrb \u00af\u00af\u00af\u00af\u00af\n\n) .\n\n(12)\n\n\u2202( \u02dc\u03b8(n+1))\n\n\u2202(\u03b8(n), \u02dcb, \u02dct)\u00af\u00af\u00af\u00af\u00af\n\nThe jump proposal distribution generates a new block and attachment edge in the topology that are\ndirectly used in the proposed object model. Hence, the change of variable factor in the Jacobian\nreduces to 1. The probability of selecting a birth move versus a death move is given by the ratio of\nrd/rb, which we have also de\ufb01ned to be 1. The complimentary block death move is similar with the\ninverse ratio of posterior and proposal distributions. We additionally de\ufb01ne split and merge moves.\nThese are essential moves in our case because the sampler often generates blocks with strong partial\n\ufb01ts and proposing splitting it is often accepted.\n\n4 Results\n\nWe evaluated our model and its inference with image sets of furniture categories, including tables,\nchairs, sofas, footstools, and desks. We have 30 images in each category containing a single arbitrary\n\n6\n\n\fPredicted Table Chair\n\nTable\nChair\nFootstool\nSofa\nDesk\n\n10\n5\n0\n0\n0\n\n5\n9\n0\n1\n0\n\nActual\nFootstool\n\nSofa Desk\n\n4\n10\n1\n0\n0\n\n0\n5\n3\n7\n0\n\n2\n3\n1\n3\n6\n\n(a)\n\n(b)\n\nFigure 4: Generated samples of tables (a) and chairs (b) from the learned structure topology and sta-\ntistical category parameters. The table shows the confusion matrix for object category recognition.\n\nview of the object instance. The images we selected for our data set have the furniture object\nprominently in the foreground. This enables focusing on evaluating how well we learn 3D structure\nmodels of objects.\n\nInference of the object and camera instances was done on detected edge and surface points in the\nimages. We applied a Canny-based detector for the edges in each image, using the same parameter-\nization each time. Thus, the images contain some edge points considered noise or that are missing\nfrom obvious contours. To extract the foreground, we applied a dynamic-threshold discovered in\neach image with a k-means algorithm. Since the furniture objects in the images primarily occupy\nthe image foreground, the detection is quite effective.\n\nWe learned the object structure for each category over a 15-image subset of our data for training\npurposes. We initialized each run of the sampler with a random draw of the category and instance\nparameters. This is accomplished by \ufb01rst sampling the prior for the object position, rotation and\ncamera view; initially there are no structural elements in the model. We then sample the likelihoods\nfor the instance parameters. The reversible-jump moves in the sampler iteratively propose adding\nand removing object constructs to the model. The mixture of moves in the sampler was 1-to-1 for\njump and diffusion and very infrequently performing a stochastic dynamics chain. Figure 6 shows\nexamples of learned furniture categories and their instances to images after 100K iterations. We\nvisualize the inferred structure topology and statistics in Figure 4 with generated samples from the\nlearned table and chair categories. We observe that the topology of the object structure is quickly\nestablished after roughly 10K iterations, this can be seen in Figure 5, which shows the simultaneous\ninference of two table instances through roughly 10K iterations.\n\nWe tested the recognition ability of the learned models on a held out 15-image subset of our data for\neach category. For each image, we draw a random sample from the category statistics and a topology\nand begin the diffusion sampling process to \ufb01t it. The best overall \ufb01t according to the joint density\nis declared the predicted category. The confusion matrix shown in Figure 4 shows mixed results.\nOverall, recognition is substantively better than chance (20%), but we expect that much better results\nare possible with our approach. We conclude from the learned models and confusion matrix that the\nchair topology shares much of its structure with the other categories and causes the most mistakes.\nWe continue to experiment with larger training data sets, clustering category structure, and longer\nrun times to get better structure \ufb01ts in the dif\ufb01cult training examples, each of which could help\nresolve this confusion.\n\nFigure 5: From left to right, successive random samples from 2 of 15 table instances, each after 2K\niterations of model inference. The category topology and statistics are learned simultaneously from\nthe set of images; the form of the structure is shared across instances.\n\n7\n\n\fFigure 6: Learning the topology of furniture objects. Sets of contiguous blocks were \ufb01t across \ufb01ve\nimage data sets. Model \ufb01tting is done jointly for the \ufb01fteen images of each set. The \ufb01ts for the\ntraining examples is shown by the blocks drawn in red. Detected edge points are shown in green.\n\nAcknowledgments\n\nThis work is supported in part by NSF CAREER Grant IIS-0747511.\n\n8\n\n\fReferences\n\n[1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning.\n\nMachine Learning, 50(1):5\u201343, 2003.\n\n[2] I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological\n\nReview, 94(2):115\u2013147, April 1987.\n\n[3] M. B. Clowes. On seeing things. Arti\ufb01cial Intelligence, 2(1):79\u2013116, 1971.\n[4] D. Crandall and D. Huttenlocher. Weakly-supervised learning of part-based spatial models for visual\n\nobject recognition. In 9th European Conference on Computer Vision, 2006.\n\n[5] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an\nincremental bayesian approach tested on 101 object categories. In Workshop on Generative-Model Based\nVision, 2004.\n\n[6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2003.\n\n[7] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.\n\nBiometrika, 82(4):711\u2013732, 1995.\n\n[8] P. J. Green. Trans-dimensional markov chain monte carlo. In Highly Structured Stochastic Systems. 2003.\n[9] D. Hoiem, C. Rother, and J. Winn. 3d layoutcrf for multi-view object class recognition and segmentation.\n\nIn CVPR, 2007.\n\n[10] D. Huttenlocher and S. Ullman. Recognizing solid objects by alignment with an image. IJCV, 5(2):195\u2013\n\n212, 1990.\n\n[11] C. Kemp and J. B. Tenenbaum. The discovery of structural form. Proceedings of the National Academy\n\nof Sciences, 105(31):10687\u201310692, 2008.\n\n[12] A. Kushal, C. Schmid, and J. Ponce. Flexible object models for category-level 3d object recognition. In\n\nCVPR, 2007.\n\n[13] M. Leordeanu, M. Hebert, and R. Sukthankar. Beyond local appearance: Category recognition from\n\npairwise interactions of simple features. In CVPR, 2007.\n\n[14] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer-Verlag, 2001.\n[15] D. G. Lowe. Fitting parameterized three-dimensional models to images. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 13(5):441\u2013450, 1991.\n\n[16] D. G. Lowe. Distinctive image features from scale-invariant keypoint. International Journal of Computer\n\nVision, 60(2):91\u2013110, 2004.\n\n[17] G. Mori and J. Malik. Recovering 3d human body con\ufb01gurations using shape contexts. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 2006.\n\n[18] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-\n\nTR-93-1, University of Toronto, 1993.\n\n[19] S. Savarese and L. Fei-Fei. 3d generic object categorization, localization and pose estimation. In IEEE\n\nIntern. Conf. in Computer Vision (ICCV), 2007.\n\n[20] S. Savarese and L. Fei-Fei. View synthesis for recognizing unseen poses of object classes. In European\n\nConference on Computer Vision (ECCV), 2008.\n\n[21] J. Schlecht and K. Barnard. Learning models of object structure. Technical report, University of Arizona,\n\n2009.\n\n[22] C. Sminchisescu. Kinematic jump processes for monocular 3d human tracking. In Computer vision and\n\npattern recognition, 2003.\n\n[23] C. Sminchisescu and B. Triggs. Estimating articulated human motion with covariance scaled sampling.\n\nInternational Journal of Robotics Research, 22(6):371\u2013393, 2003.\n\n[24] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes,\n\nobjects, and parts. In ICCV, 2005.\n\n[25] K. Sugihara. A necessary and suf\ufb01cient condition for a picture to represent a polyhedral scene. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 6(5):578\u2013586, September 1984.\n\n[26] J. B. Tenenbaum, T. L. Grif\ufb01ths, and C. Kemp. Theory-based bayesian models of inductive learning and\n\nreasoning. Trends in Cognitive Sciences, 10(7):309\u2013318, 2006.\n\n[27] Z. Tu and S.-C. Zhu. Image segmentation by data-driven markov chain monte-carlo. IEEE Trans. Patt.\n\nAnaly. Mach. Intell., 24(5):657\u2013673, 2002.\n\n[28] P. H. Winston. Learning structural descriptions from examples. In P. H. Winston, editor, The psychology\n\nof computer vision, pages 157\u2013209. McGraw-Hill, 1975.\n\n[29] L. Zhu, Y. Chen, and A. Yuille. Unsupervised learning of a probabilistic grammar for object detection\n\nand parsing. In NIPS, 2006.\n\n[30] S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics\n\nand Vision, 4(2):259\u2013362, 2006.\n\n9\n\n\f", "award": [], "sourceid": 890, "authors": [{"given_name": "Joseph", "family_name": "Schlecht", "institution": null}, {"given_name": "Kobus", "family_name": "Barnard", "institution": null}]}