{"title": "Analyzing 3D Objects in Cluttered Images", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 601, "abstract": "We present an approach to detecting and analyzing the 3D configuration of objects in real-world images with heavy occlusion and clutter. We focus on the application of finding and analyzing cars. We do so with a two-stage model; the first stage reasons about 2D shape and appearance variation due to within-class variation(station wagons look different than sedans) and changes in viewpoint. Rather than using a view-based model, we describe a compositional representation that models a large number of effective views and shapes using a small number of local view-based templates. We use this model to propose candidate detections and 2D estimates of shape. These estimates are then refined by our second stage, using an explicit 3D model of shape and viewpoint. We use a morphable model to capture 3D within-class variation, and use a weak-perspective camera model to capture viewpoint. We learn all model parameters from 2D annotations. We demonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D shape reconstruction on challenging images from the PASCAL VOC 2011 dataset.", "full_text": "Analyzing 3D Objects in Cluttered Images\n\nMohsen Hejrati\n\nUC Irvine\n\nDeva Ramanan\n\nUC Irvine\n\nshejrati@ics.uci.edu\n\ndramanan@ics.uci.edu\n\nAbstract\n\nWe present an approach to detecting and analyzing the 3D con\ufb01guration of objects\nin real-world images with heavy occlusion and clutter. We focus on the application\nof \ufb01nding and analyzing cars. We do so with a two-stage model; the \ufb01rst stage\nreasons about 2D shape and appearance variation due to within-class variation\n(station wagons look different than sedans) and changes in viewpoint. Rather\nthan using a view-based model, we describe a compositional representation that\nmodels a large number of effective views and shapes using a small number of\nlocal view-based templates. We use this model to propose candidate detections\nand 2D estimates of shape. These estimates are then re\ufb01ned by our second stage,\nusing an explicit 3D model of shape and viewpoint. We use a morphable model\nto capture 3D within-class variation, and use a weak-perspective camera model\nto capture viewpoint. We learn all model parameters from 2D annotations. We\ndemonstrate state-of-the-art accuracy for detection, viewpoint estimation, and 3D\nshape reconstruction on challenging images from the PASCAL VOC 2011 dataset.\n\n1\n\nIntroduction\n\nFigure 1: We describe two-stage models for detecting and analyzing the 3D shape of objects in\nunconstrained images. In the \ufb01rst stage, our models reason about 2D appearance and shape using\nvariants of deformable part models (DPMs). We use global mixtures of trees with local mixtures\nof gradient-based part templates (top-left). Global mixtures capture constraints on visibility and\nshape (headlights are only visible in certain views at certain locations), while local mixtures capture\nconstraints on appearance (headlights look different in different views). Our 2D models localize\neven fully-occluded landmarks, shown as hollow circles and dashed lines in (top-middle). We feed\nthis output to our second stage, which directly reasons about 3D shape and camera viewpoint. We\nshow the reconstructed 3D model and associated ground-plane (assuming its parallel to the car body)\non (top-right). The bottom row shows 3D reconstructions from four novel viewpoints.\n\nA grand challenge in machine vision is the task of understanding 3D objects from 2D images. Clas-\nsic approaches based on 3D geometric models [2] could sometimes exhibit brittle behavior on clut-\ntered, \u201cin-the-wild\u201d images. Contemporary recognition methods tend to build statistical models of\n2D appearance, consisting of classi\ufb01ers trained with large training sets using engineered appear-\nance features. Successful examples include face detectors [30], pedestrian detectors [7], and general\n\n1\n\n\fobject-category detectors [10]. Such methods seem to work well even in cluttered scenes, but are\nusually limited to coarse 2D output, such as bounding-boxes.\nOur work is an attempt to combine the two approaches, with a focus on statistical, 3D geometric\nmodels of objects. Speci\ufb01cally, we focus on the practical application of detecting and analyzing\ncars in cluttered, unconstrained images. We refer the reader to our results (Fig.4) for a sampling of\ncluttered images that we consider. We develop a model that detects cars, estimates camera viewpoint,\nand recovers 3D landmarks con\ufb01gurations and their visibility with state-of-the-art accuracy. It does\nso by reasoning about appearance, 3D shape, and camera viewpoint through the use of 2D structured,\nrelational classi\ufb01ers and 3D geometric subspace models.\nWhile deformable models and pictorial structures [10, 31, 11] are known to successfully model ar-\nticulation, 3D viewpoint is still not well understood. The typical solution is to \u201cdiscretize\u201d viewpoint\nand build multiple view-based models tuned for each view (frontal, side, 3/4...). One advantage of\nsuch a \u201cbrute-force\u201d approach is that it is computationally ef\ufb01cient, at least for a small number of\nviews. Fine-grained 3D shape estimation may still be dif\ufb01cult with such a strategy. On the other\nhand, it is dif\ufb01cult to build models that reason directly in 3D because the \u201cinverse-rendering\u201d prob-\nlem is hard to solve. We introduce a two-stage approach that \ufb01rst reasons about 2D shape and appear-\nance variation, and then reasons explicitly about 3D shape and viewpoint given 2D correspondences\nfrom the \ufb01rst stage. We show that \u201cinverse-rendering\u201d is feasible by way of 2D correspondences.\n2D shape and appearance: Our \ufb01rst stage models 2D shape and appearance using a variant of\ndeformable part models (DPMs) designed to produce reliable 2D landmark correspondences. Our\napproach differs from traditional view-based models in that it is compositional; it \u201ccuts and pastes\u201d\ntogether different sets of local view-based templates to model a large set of global viewpoints. We\nuse global mixtures of trees with local mixtures of \u201cpart\u201d or landmark templates. Global mixtures\ncapture constraints on visibility and shape (headlights are only visible in certain views at certain\nlocations), while local mixtures capture constraints on appearance (headlights look different in dif-\nferent views). We use this model to ef\ufb01ciently generate candidate 2D detections that are re\ufb01ned\nby our second 3D stage. One salient aspect of our 2D model is that it reports 2D locations of all\nlandmarks including occluded ones, each augmented with a visibility \ufb02ag.\n3D shape and viewpoint: Our second layer processes the 2D output of our \ufb01rst stage, incorporat-\ning global shape constraints arising from 3D shape variation and viewpoint. To capture viewpoint\nconstraints, we model landmarks as weak-perspective projections of a 3D object. To capture within-\nclass variation, we model the 3D shape of any object instance as a linear combination of 3D basis\nshapes. We use tools from nonrigid structure-from-motion (SFM) to both learn and enforce such\nmodels using 2D correspondences. Crucially, we make use of occlusion reports generated by our\nlocal view-based templates to estimate morphable 3D shape and camera viewpoint.\n\n2 Related Work\n\nWe focus most on recognition methods that deal explicitly with 3D viewpoint variation.\nVoting-based methods: One approach to detection and viewpoint classi\ufb01cation is based on bottom-\nup geometric voting, using a Hough transform or geometric hashing. Images are \ufb01rst processed to\nobtain a set of local feature detections. Each detection can then vote for both an object location\nand viewpoint. Examples include [12] and implicit shape models [1, 26]. Our approach differs in\nthat we require no initial feature detection stage, and instead we reason about all possible geometric\ncon\ufb01gurations and occlusion states.\nView-based models: Early successful approaches included multivew face detection [24, 17]. Recent\napproaches based on view-based deformable part models include [19, 13, 10]. Our model differs in\nthat we use a single representation that directly generates multiple views. One can augment view-\nbased models to share local parts across views [27, 21, 32]. This typically requires reasoning about\ntopological changes in viewpoint; certain parts or features can only be visible in certain view due\nto self-occlusion. One classic representation for encoding such visibility constraints is an aspect\ngraph [5]. [33] model such topological constraints with global mixtures with varying tree structures.\nOur model is similar to such approaches, except that we use a decomposable notion of aspect; we\nsimultaneously reason about global and semi-local changes in visibility using local part mixtures\nwith global co-occurrence constraints.\n\n2\n\n\f3D models: One can also directly reason about local features and their geometric arrangement in\na 3D coordinate system [23, 25, 34]. Though such models are three-dimensional in terms of their\nunderlying representation, run-time inference usually proceeds in a bottom-up manner, where de-\ntected features vote for object locations. To handle non-Gaussian observation models, [18] evaluate\nrandomly sampled model estimates within a RANSAC search. Our approach is closely related to the\nrecent work of [22], which also uses a deformable part model (DPM) to capture viewpoint variation\nin cars. Though they learn spatial constraints in a 3D coordinate frame, their model at run-time is\nequivalent to a view-based model, where each view is modeled with a star-structured DPM. Our\nmodel differs in that we directly reason about the location of fully-occluded landmarks, we model\nan exponential number of viewpoints by using a compositional representation, and we produce con-\ntinuous 3D shapes and camera viewpoints associated with each detection using only 2D training\ndata. Finally, we represent the space of 3D models of an object category using a set of basis shapes,\nsimilar to the morphable models of [3]. To estimate such models from 2D data, we adapt methods\ndesigned for tracking morphable shapes to 3D object category recognition [29, 28].\n3 2D Shape and Appearance\nWe \ufb01rst describe our 2D model of shape and appearance. We write it as a scoring function with\nlinear parameters. Our model can be seen as an extension of the \ufb02exible mixtures-of-part model\n[31], which itself augments a deformable part model (DPM) [10] to reason about local mixtures.\nOur model differs its encoding of occlusion states using local mixtures, as well as the introduc-\ntion of global mixtures that enforce occlusions and spatial geometry consistent with changes in 3D\nviewpoint. We take care to design our model so as to allow for ef\ufb01cient dynamic-programming\nalgorithms for inference.\nLet I be an image, pi = (x, y) be the pixel location for part i and ti \u2208 {1..T} be the local mixture\ncomponent of part i. As an example, part i may correspond to a front-left headlight, and ti can\ncorrespond to different appearances of a headlight in frontal, side, or three-quarter views. A notable\naspect of our model is that we estimate landmark locations for all parts in all views, even when they\nare fully occluded. We will show that local mixture variables perform surprisingly well at modeling\ncomplex appearances arising from occlusions.\nLet i \u2208 V where V is the set of all landmarks. We consider different relational graphs Gm =\n(V, Em) where Em connects pairs of landmarks constrained to have consistent locations and local\nmixtures in global mixture m. We can loosely think of m as a \u201cglobal viewpoint\u201d, though it will be\nlatently estimated from the data. We use the lack of subscript to denote the set of variables obtained\nby iterating over that subscript; e.g., p = {pi : i \u2208 V }. Given an image, we score a collection of\nlandmark locations and mixture variables\n\nS(I, p, t, m) =\n\n\u00b7 \u03c6(I, pi)\n\n\u03b1ti\ni\n\n+\n\nijm \u00b7 \u03c8(pi \u2212 pj) + \u03b3ti,tj\n\u03b2ti,tj\n\nijm\n\n(1)\n\nLocal model: The \ufb01rst term scores the appearance evidence for placing a template \u03b1ti\nfor part i,\ni\ntuned for mixture ti, at location pi. We write \u03c6(I, pi) for the feature vector (e.g., HOG descriptor\n[7]) extracted from pixel location pi in image I. Note that we de\ufb01ne a template even for mixtures\nti corresponding to fully-occluded states. One may argue that no image evidence should be scored\nduring an occlusion; we take the view that the learning algorithm can decide for itself.\nIt may\nchoose to learn a template of all zeros (essentially ignoring image evidence) or it may \ufb01nd gradient\nfeatures statistically correlated with occlusions (such as t-junctions). Unlike the remaining terms in\nour scoring function, the local appearance model is not dependent on the global mixture/viewpoint.\nWe show that this independence allows our model to compose together different local mixtures to\nmodel a single global viewpoint.\nRelational model: The second term scores relational constraints between pairs of parts. We write\n\ndy dy2(cid:3), a vector of relative offsets between part i and part j. We\n\n\u03c8(pi \u2212 pj) = (cid:2)dx dx2\n\ncan interpret \u03b2ti,tj\nijm as the parameters of a spring specifying the relative rest location and quadratic\nspring penalty for deviating from that rest location. Notably, this spring depends on part i and j,\nthe local mixture components of part i and j, and the global mixture m. This dependency captures\nmany natural constraints due to self-occlusion; for example, if a car\u2019s left-front wheel lies to the\nright of the other front wheel (in image space), than it is likely self-occluded. Hence it is crucial that\nlocal appearance and geometry depend on each other. The last term \u03b3ti,tj\nijm de\ufb01nes a co-occurrence\nscore associated with instancing local mixture ti and tj, and global mixture m. This encodes the\n\n3\n\n(cid:104)\n\n(cid:88)\n\ni\u2208V\n\n(cid:105)\n\n(cid:88)\n\n(cid:104)\n\nij\u2208Em\n\n(cid:105)\n\n\fconstraint that, if the left front headlight is occluded due to self occlusion, the left front wheel is also\nlikely occluded.\nGlobal model: We de\ufb01ne different graphs Gm = (V, Em) corresponding to different global mix-\ntures. We can loosely think of the global variable m are capturing a coarse, quantized viewpoint. To\nensure tractability, we force all edge structures to be tree-structured. Intuitively, different relational\nstructures may help because occluded landmarks tend to be localized with less reliability. One may\nexpect occluded/unreliable parts should have fewer connections (lower degrees in Gm) than reliable\nparts. Even for a \ufb01xed global mixture m, our model can generate an exponentially-large set of ap-\npearances |V |T , where T is the number of local mixture types. We show such a model outperforms\na naive view-based model in our experiments.\n\nInference\n\n3.1\nInference corresponds to maximizing (1) with respect to landmark locations p, local mixtures t, and\nglobal mixtures m:\n\nS\u2217(I) = max\n\n[max\np,t\n\nm\n\nS(I, p, t, m)]\n\n(2)\n\nWe optimize the above equation by enumerating all global mixtures m, and for each global mixture,\n\ufb01nding the optimal combination of landmark locations p and local mixtures t by dynamic program-\ning (DP). To see that the inner maximization can be optimized by DP, let us de\ufb01ne zi = (pi, ti) to\ndenote both the discrete pixel position and discrete mixture type of part i. We can rewrite the score\nfrom (1) for a \ufb01xed image I and global mixture m with edge structure E as:\n\n(cid:88)\n\n(cid:88)\n\nij\u2208E\n\nS(z) =\n\n\u03c6i(zi) +\n\n\u03c8ij(zi, zj),\n\n(for a \ufb01xed I and m)\n\n(3)\n\ni\u2208V\nwhere \u03c6i(zi) = \u03b1ti\ni\n\n\u00b7 \u03c6(I, pi)\n\nand \u03c8ij(zi, zj) = \u03b2ti,tj\n\nijm \u00b7 \u03c8(pi \u2212 pj) + \u03b3ti,tj\n\nijm\n\nFrom this perspective, it is clear that our model (conditioned on I and m) is a discrete, pairwise\nMarkov random \ufb01eld (MRF). When G = (V, E) is tree-structured, one can compute maxz S(z)\nwith dynamic programming [31].\n\n3.2 Learning\nWe assume we are given training data consisting of image-landmark triplet {In, pin, oin}, where\nlandmarks are augmented with an additional discrete visibility \ufb02ag oin. With a slight abuse of nota-\ntion, we use n to denote an instance of a training image. We use oin \u2208 {0, 1, 2} to denote visible,\nself-occlusion, and other-occlusion respectively, where other occlusion corresponds to a landmark\nthat is occluded by another object (or the image border). We now show how to augment this train-\ning set with local mixtures labels tin, global mixtures labels mn, and global edge structures Em.\nEssentially, we infer such mixture labels using probabilistic algorithms for generating local/global\nclusters of 2D landmark con\ufb01gurations. We then use this inferred mixture labels to train the linear\nparameters of the scoring function (1) using supervised, max-margin methods.\nLearning local mixtures: We use the clustering algorithm described in [8, 4] to learn local part mix-\ntures. We construct a \u201clocal-geometric-context\u201d vector for each part, and obtain landmark mixture\nlabels by grouping landmark instances with similar local geometry. Speci\ufb01cally, for each landmark\ni and image n, we construct a K-element vector gin that de\ufb01nes the 2D relative location of a land-\nmark with respect to the other K landmarks in instance n, normalized for the size of that training\ninstance. We construct sets of features Setij = {gin : n \u2208 1..N and oin = j} corresponding to\neach part i and occlusion state j. We separately cluster each set of vectors using K-means, and then\ninterpret cluster membership as mixture label tin. This means that, for landmark i, a third of its T\nlocal mixtures will model visible instances in the training set, a third will model self-occlusions, and\na third will capture other-occlusions.\nLearning relational structure: Given local mixture labels tin, we simultaneously learn global mix-\ntures mn and edge structure Em with a probabilistic model of zin = (pin, tin). We \ufb01nd the global\nmixtures and edge structure that maximizes the probability of the observed {zin} labels. Proba-\nbilistically speaking, our spatial spring model is equivalent to a Gaussian model (who\u2019s mean and\ncovariance correspond to the rest location and rigidity), making estimation relatively straightfor-\nward. We \ufb01rst describe the special case of a single global mixture, for which the most-likely tree E\ncan be obtained by maximizing the mutual information of the labels using the Chow-Liu algorithm\n\n4\n\n\f[6, 15]. In our case, we \ufb01nd the maximum-weight spanning tree in a fully connected graph whose\nedges are labeled with the mutual information (MI) between zi = (pi, ti) and zj = (pj, tj):\n\nM I(zi, zj) = M I(ti, tj) +\n\nP (ti, tj)M I(pi, pj|ti, tj)\n\n(4)\n\n(cid:88)\n\nti,tj\n\nM I(ti, tj) can be directly computed from the empirical joint frequency of mixture labels in the\ntraining set. M I(pi, pj|ti, tj) is the mutual information of the Gaussian random variables for the\nlocation of landmarks i and j given a \ufb01xed pair of discrete mixture types ti, tj; this again is readily\nobtained by computing the determinant of the sample covariance of the locations of landmarks i and\nj, estimated from the training data. Hence both spatial consistency and mixture consistency are used\nwhen learning our relational structure.\nLearning structure and global mixtures: To simultaneously learn global mixture labels mn and\nedge structures associated with each mixture Em, we use an EM algorithm for learning mixtures of\ntrees [20, 15]. Speci\ufb01cally, Meila and Jordan [20] describe an EM algorithm that iterates between\ninferring distributions over tree mixture assignments (the E-step) and estimating the tree structure\n(the M-step). One can write the expected complete log-likelihood of the observed labels {z}, where\n\u03b8 are the model parameters (Gaussian spatial models, local mixture co-occurrences and global mix-\nture priors) to be maximized and the global mixture assignment variables {mn} are the hidden\nvariables to be marginalized. Notably, the M-step makes use of the Chow-Liu algorithm. We omit\ndetailed equations for lack of space, but note that this is a relatively straightforward application of\n[20]. We demonstrate that our latently-estimated global mixtures are crucial for high-performance\nin 3D reasoning.\nLearning parameters: The previous steps produces local/global mixture labels and edge structures.\nTreating these as \u201cground-truth\u201d, we now de\ufb01ne a supervised max-margin framework for learning\nmodel parameters. To do so, let us write the landmark position labels pn, local mixtures labels\ntn, and global mixture label mn collectively as yn. Given a training set of positive images with\nlabels {In, yn} and negative images not containing the object of interest, we de\ufb01ne a structured\nprediction objective function similar to one proposed in [31]. The scoring function in (1) is linear\nin the parameters w = {\u03b1, \u03b2, \u03b3}, and therefore can be expressed as S(In, yn) = w \u00b7 \u03a6(In, yn). We\nlearn a model of the form:\n\nargmin\nw,\u03bei\u22650\n\nwT \u00b7 w + C\n\n1\n2\n\n(cid:88)\n\nn\n\n\u03ben\n\n(5)\n\ns.t. \u2200n \u2208 positive images w \u00b7 \u03a6(In, yn) \u2265 1 \u2212 \u03ben\n\u2200n \u2208 negative images,\u2200y w \u00b7 \u03a6(In, y) \u2264 \u22121 + \u03ben\n\nThe above constraint states that positive examples should score better than 1 (the margin), while\nnegative examples, for all con\ufb01gurations of part positions and mixtures, should score less than -1.\nWe collect negative examples from images that does not contain any cars. This form of learning\nproblem is known as a structural SVM, and there exist many well-tuned solvers such as the cutting\nplane solver of SVMStruct in [16] and the stochastic gradient descent solver in [10]. We use the\ndual coordinate-descent QP solver of [31]. We show an example of a learned model and its learned\ntree structure in Fig.1.\n\n4 3D Shape and Viewpoint\nThe previous section describes our 2D model of appearance and shape. We use it to propose detec-\ntions with associated landmarks positions p\u2217. In this section, we describe a 3D shape and viewpoint\nmodel for re\ufb01ning p\u2217. Consider 2D views of a single rigid object; 2D landmark positions must obey\nepipolar geometry constraints. In our case, we must account for within-class shape variation as well\n(e.g., sedans look different than station wagons). To do so, we make two simplifying assumptions:\n(1) We assume depth variation of our objects are small compared to the distance from the camera,\nwhich corresponds to a weak-perspective camera model. (2) We assume the 3D landmarks of all\nobject instances can be written as linear combinations of a few basis shapes. Let us write the set of\ndetected landmark positions as p\u2217 as a 2 \u00d7 K matrix where K = |V |. We now describe a procedure\nfor re\ufb01ning p\u2217 to be consistent with these two assumptions:\n\n\u03b1iBi||2 where p \u2208 R2\u00d7K, R \u2208 R2\u00d73, RRT = Id, Bi \u2208 R3\u00d7K\n\n(6)\n\n(cid:88)\n\n||p\u2217 \u2212 R\n\nmin\nR,\u03b1\n\ni\n\n5\n\n\fHere, R is an orthonormal camera projection matrix and Bi is the ith basis shape, and Id is the\nidentity matrix. We factor out camera translations by working with mean-centered points p\u2217 and let\n\u03b1 directly model weak-perspective scalings.\nInference: Given 2D landmark locations p\u2217 and a known set of 3D basis shapes Bi, inference\ncorresponds to minimizing (6). For a single basis shape (nB = 1), this problem is equivalent to\nthe well-known \u201cextrinsic orientation\u201d problem of registering a 3D point cloud to a 2D point cloud\nwith known correspondence [14]. Because the squared error is linear in ai and R, we solve for the\ncoef\ufb01cients and rotation with an iterative least-squares algorithm. We enforce the orthonormality\nof R with a nonlinear optimization, initialized by the least-squares solution [14]. This means that\nwe can associate each detection with shape basis coef\ufb01cients \u03b1i (which allows us to reconstruct\nthe 3D shape) and camera viewpoint R. One could combine the reprojection error of (6) with our\noriginal scoring function from (1) into a single objective that jointly searches over all 2D and 3D\nunknowns. However inference would be exponential in K. We \ufb01nd a two-layer inference algorithm\nto be computationally ef\ufb01cient but still effective.\nLearning: The above inference algorithm requires the morphable 3D basis Bi at test-time. One\ncan estimate such a basis given training data with labeled 2D landmark positions by casting this as\nnonrigid structure from motion (SFM) problem. Stack all 2D landmarks from N training images\ninto a 2N \u00d7 K matrix. In the noise-free case, this matrix is rank 3nB (where nB is the number of\nbasis shapes), since each row can be written as a linear combination of the 3D coordinates of nB\nbasis shapes. This means that one can use rank constraints to learn a 3D morphable basis. We use the\npublically-available nonrigid SFM code [28]. By slightly modifying it to estimate \u201cmotion\u201d given a\nknown \u201cstructure\u201d, we can also use it to perform the previous projection step during inference.\nOcclusion: A well-known limitation of SFM methods is their restricted success under heavy occlu-\nsion. Notably, our 2D appearance model provides location estimates for occluded landmarks. Many\nSFM methods (including [28]) can deal with limited occlusion through the use of low-rank con-\nstraints; essentially, one can still estimate low-rank approximations of matrices with some missing\nentries. We can use this property to learn models from partially-labeled training sets. Recall that\nour learning formulation requires all landmarks (including occluded ones) to be labeled in training\ndata. Manually labeling the positions of occluded landmarks can be ambiguous. Instead, we use the\nestimated shape basis and camera viewpoints to infer/correct the locations of occluded landmarks.\n5 Experiments\nDatasets: To evaluate our model, we focus on car detection and 3D landmark estimation in cluttered,\nreal-world datasets with severe occlusions. We labeled a subset of 500 images from the PASCAL\nVOC 2011 dataset [9] with locations and visibility states of 20 car landmarks. Our dataset contains\n723 car instances. 36% of landmarks are not visible due to self-occlusion, while 21% of landmarks\nare not visible due to occlusion by another object (or truncation due to the image border). Hence\nover half our landmarks are occluded, making our dataset considerably more dif\ufb01cult than those\ntypically used for landmark localization or 3D viewpoint estimation. We evenly split the images\ninto a train/test set. We also compare results on a more standard viewpoint dataset from [1], which\nconsists of 200 relatively \u201cclean\u201d cars from the PASCAL VOC 2007 dataset, marked with 40 discrete\nviewpoint class labels.\nImplementation: We modify the publically-available code of [31] and [28] to learn our models,\nsetting the number of local mixtures T = 9, the number of global mixtures M = 50, and the\nnumber of basis shapes nB = 5. We found results relatively robust to these settings. Learning our\n2D deformable model takes roughly 4 hours, while learning our 3D shape model takes less than\na minute. Our model is de\ufb01ned at a canonical scale, so we search over an image pyramid to \ufb01nd\ndetections at multiple scales. Total run-time for a test image (including both 2D and 3D processing\nover all scales) is 10 seconds.\nEvaluation: Given an image, our algorithm produces multiple detections, each with 3D landmark\nlocations, visibility \ufb02ags, and camera viewpoints. We qualitatively visualize such output in Fig.4.\nTo evaluate our output, we assume test images are marked with ground-truth cars, each annotated\nwith ground-truth 2D landmarks and visibility \ufb02ags. We measure the performance of our algorithm\non four tasks. We evaluate object detection (AP) using using the PASCAL criteria of Average\nPrecision [9], de\ufb01ning a detection to be correct if its bounding box overlaps the ground truth by\n50% or more. We evaluate 2D landmark localization (LP) by counting the fraction of predicted\n\n6\n\n\fFigure 2: We report histograms of viewpoint label errors for the dataset of [1]. We compare to the\nreported performance of [1] and [12]. Our model reduces the median error (right) by a factor of 2.\n\n(a) Baseline comparison\n\n(b) Diagnostic analysis\n\nFigure 3: We compare our model with various view-based baselines in (a), and examine various\ncomponents of our model through a diagnostic analysis in (b). We refer the reader to the text for\na detailed analysis, but our model outperforms many state-of-the-art view-based baselines based on\ntrees, stars, and latent parts. We also \ufb01nd that modeling the effects of shape due to global changes\nin 3D viewpoint is crucial for both detection and landmark localization.\n\nlandmarks that lie within .5x pixels of the ground-truth, where x is the diameter of the associated\nground-truth wheel. We evaluate landmark visibility prediction (VP) by counting the number\nof landmarks whose predicted visibility state matches the ground-truth, where landmarks may be\n\u201cvisible\u201d, \u201cself-occluded\u201d, or \u201cother-occluded\u201d. Our 3D shape model re\ufb01nes only LP and VP, so\nAP is determined solely by our 2D (mixtures of trees) model. To avoid con\ufb02ating the evaluation\nmeasures, we evaluate LP and VP assuming bounding-box correspondences between candidates and\nground-truth instances are provided. Finally to evaluate viewpoint classi\ufb01cation (VC), we compare\npredicted camera viewpoints with ground-truth viewpoints on the standard benchmark of [1].\nViewpoint Classi\ufb01cation: We \ufb01rst present results for viewpoint classi\ufb01cation in Fig.2 on the bench-\nmark of [1]. Given a test instance, we run our detector, estimate the camera rotation R, and report the\nreconstructed 2D landmarks generated using the estimated R. Then we produce a quantized view-\npoint label by matching the reconstructions to landmark locations for a reference image (provided\nin the dataset). We found this approach more reliable than directly matching 3D rotation matrices\n(for which metric distances are hard to de\ufb01ne). We produce a median error of 9 degrees, a factor of\n2 improvement over state-of-the-art. This suggests our model does accurately capture viewpoints.\nWe next turn to a detailed analysis on our new cluttered dataset.\nBaselines: We compare the performance of our overall system to several existing approaches for\nmultiview detection in Fig.3(a). We \ufb01rst compare to widely-used latent deformable part model\n(DPM) of [10], trained on the exact same data as our model. A supervised DPM (MV-star) consid-\nerably improves performance from 63 to 74% AP, where supervision is provided for (view-speci\ufb01c)\nroot mixtures and part locations. This latter model is equivalent in structure to a state-of-the-art\nmodel for car detection and viewpoint estimation [22], which trains a DPM using supervision pro-\nvided by a 3D CAD model. By allowing for tree-structured relations in each view-speci\ufb01c global\nmixture (MV-tree), we see a small drop in AP = 72.3%. Our \ufb01nal model is similar in term of\ndetection performance (AP = 72.5%), but does noticeably better than both view-based models for\nlandmark prediction. We correctly localize landmarks 69.5% of time, while MV-tree and MV-star\nscore 65.7% and 64.7%, respectively. We produce landmark visibility (VP) estimates from our mul-\ntiview baselines by predicting a \ufb01xed set of visibility labels conditioned on the view-based mixture.\nWe should note that accurate landmark localization is crucial for estimating the 3D shape of the de-\ntected instance. We attribute our improvement to the fact that our model can model a large number\nof global viewpoints by composing together different local view-based templates.\n\n7\n\n050100150020406080degrees  Our modelArie\u2212Nachimson and BasriGlasner et al.0102030Our ModelArie\u2212Nachimson and BasriGlasner et al.Median Degree Error0.550.60.650.70.750.8  LPVPAPMV TreeMV StarUS00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecision  DPM = 63.6%MV star = 74.0%MV Tree = 72.3%Us = 72.5%0.550.60.650.70.750.8  LPVPAPGlobal+3DGlobalLocal00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecision  Local = 69%Global = 72.5%\fFigure 4: Sample results of our system on real images with heavy clutter and occlusion. We show\npairs of images corresponding to detections that matched to ground-truth annotations. The top image\n(in the pair) shows the output of our tree model, and the bottom shows our 3D shape reconstruction,\nfollowing the notational conventions of Fig.1. Our system estimates 3D shapes of multiple cars\nunder heavy clutter and occlusions, even in cases where more than 50% of a car is occluded. Our\nmorphable 3D model adapts to the shape of the car, producing different reconstructions for SUVs\nand sedans (row 2, columns 2-3). Recall that our tree model explicitly reasons about changes in\nvisibility due to self-occlusions versus occlusions from other objects, manifested as local mixture\ntemplates. This allow our 3D reconstructions to model occlusions due to other objects (e.g., the rear\nof the car in row 2, column 3). In some cases, the estimated 3D shape is misaligned due to extreme\nshape variation of the car instance (e.g., the folding doors on the lower-right).\nDiagnostics: We compare various aspects of our model in Fig.3(b). \u201cLocal\u201d refers to a single tree\nmodel with local mixtures only, while \u201cGlobal\u201d refers to our global mixtures of trees. We see a\nsmall improvement in terms of AP, from 69% for \u201cLocal\u201d to 72.5% for \u201cGlobal\u201d. However, in terms\nof landmark prediction, \u201cGlobal\u201d strongly outperforms \u201cLocal\u201d, 69.4% to 57.2%. We use these\npredicted landmarks to estimate 3D shape below.\n3D Shape: Our 3D shape model reports back a z depth value for each landmark (x, y) position.\nUnfortunately, depth is hard to evaluate without ground-truth 3D annotations. Instead, we evaluate\nthe improvement in re-projected VP and LP due to our 3D shape model; we see a small 2% improve-\nment in LP accuracy, from 69.4% to 71.2%. We further analyze this by looking at the improvement\nin localization accuracy of ground-truth landmarks that are visible (73.3 to 74.8%), self-occluded\n(70.5 to 72.5%), and other-occluded (22.5 to 23.4%). We see the largest improvement for occluded\nparts, which makes intuitive sense. Local templates corresponding to occluded mixtures will be less\naccurate, and so will bene\ufb01t more from a 3D shape model.\nConclusion: We have described a geometric model for detecting and estimating the 3D shape of\nobjects in heavily cluttered, occluded, real-world images. Our model differs from typical multiview\napproaches by reasoning about local changes in landmark appearance and global changes in visi-\nbility and shape, through the aid of a morphable 3D model. While our model is similar to prior\nwork in terms of detection performance, it produces signi\ufb01cantly better estimates of 2D/3D land-\nmarks and camera positions, and quanti\ufb01ably improves localization of occluded landmarks. Though\nwe have focused on the application of analyzing cars, we believe our method could apply to other\ngeometrically-constrained objects.\n\n8\n\n\fReferences\n[1] M. Arie-Nachimson and R. Basri. Constructing implicit 3d shape models for pose estimation. In ICCV,\n\n2009.\n\n[2] T. Binford. Survey of model-based image analysis systems. The International Journal of Robotics Re-\n\nsearch, 1(1):18\u201364, 1982.\n\n[3] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th an-\nnual conference on Computer graphics and interactive techniques, pages 187\u2013194. ACM Press/Addison-\nWesley Publishing Co., 1999.\n\n[4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In\n\nComputer Vision, 2009 IEEE 12th International Conference on, pages 1365\u20131372. IEEE, 2009.\n\n[5] K. Bowyer and C. Dyer. Aspect graphs: An introduction and survey of recent results.\n\nJournal of Imaging Systems and Technology, 2(4):315\u2013328, 1990.\n\nInternational\n\n[6] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information\n\nTheory, IEEE Transactions on, 14(3):462\u2013467, 1968.\n\n[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[8] C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. ECCV, 2012.\n[9] M. Everingham, L. Van Gool, C. K.\nThe\nhttp://www.pascal-\n\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\nI. Williams,\n\nJ. Winn,\n\nand A. Zisserman.\n\n[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part based models. IEEE PAMI, 99(1), 5555.\n\n[11] R. Girshick, P. Felzenszwalb, and D. McAllester. Object detection with grammar models. In NIPS, 2011.\n[12] D. Glasner, M. Galun, S. Alpert, R. Basri, and G. Shakhnarovich. Viewpoint-aware object detection and\n\npose estimation. In ICCV, pages 1275\u20131282. IEEE, 2011.\n\n[13] C. Gu and X. Ren. Discriminative mixture-of-templates for viewpoint classi\ufb01cation. ECCV, pages 408\u2013\n\n421, 2010.\n\n[14] B. Horn. Robot vision. The MIT Press, 1986.\n[15] S. Ioffe and D. Forsyth. Mixtures of trees for object recognition. In CVPR, 2001.\n[16] T. Joachims, T. Finley, and C. Yu. Cutting plane training of structural SVMs. Machine Learning, 2009.\n[17] M. Jones and P. Viola. Fast multi-view face detection. In CVPR 2003.\n[18] Y. Li, L. Gu, and T. Kanade. A robust shape model for multi-view car alignment. In CVPR, 2009.\n[19] R. Lopez-Sastre, T. Tuytelaars, and S. Savarese. Deformable part models revisited: A performance eval-\n\nuation for object category pose estimation. In Computer Vision Workshops (ICCV Workshops), 2011.\n\n[20] M. Meila and M. Jordan. Learning with mixtures of trees. JMLR, 1:1\u201348, 2001.\n[21] P. Ott and M. Everingham. Shared parts for deformable part-based models. In CVPR, 2011.\n[22] B. Pepik, M. Stark, P. Gehler, and B. Scheile. Teaching geometry to deformable part models. In CVPR,\n\n2012.\n\n[23] S. Savarese and L. Fei-Fei. 3d generic object categorization, localization and pose estimation. In ICCV,\n\npages 1\u20138. IEEE, 2007.\n\n[24] H. Schneiderman and T. Kanade. A statistical method for 3d object detection applied to faces and cars.\n\nIn CVPR, volume 1, pages 746\u2013751. IEEE, 2000.\n\n[25] M. Sun, H. Su, S. Savarese, and L. Fei-Fei. A multi-view probabilistic model for 3d object classes. In\n\nCVPR, pages 1247\u20131254. IEEE, 2009.\n\n[26] A. Thomas, V. Ferrar, B. Leibe, T. Tuytelaars, B. Schiel, and L. Van Gool. Towards multi-view object\n\nclass detection. In CVPR, volume 2, pages 1589\u20131596. IEEE, 2006.\n\n[27] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object\n\ndetection. PAMI, 29(5):854\u2013869, 2007.\n\n[28] L. Torresani, A. Hertzmann, and C. Bregler. Learning non-rigid 3d shape from 2d motion. Advances in\n\nNeural Information Processing Systems, 16, 2003.\n\n[29] L. Torresani, D. Yang, E. Alexander, and C. Bregler. Tracking and modeling non-rigid objects with rank\n\nconstraints. In CVPR, volume 1, pages I\u2013493. IEEE, 2001.\n\n[30] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.\n\nvolume 1, pages I\u2013511. IEEE, 2001.\n\nIn CVPR,\n\n[31] Y. Yang and D. Ramanan. Articulated pose estimation with \ufb02exible mixtures-of-parts. In CVPR, 2011.\n[32] L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille. Part and appearance sharing: Recursive\n\ncompositional models for multi-view multi-object detection. Pattern Recognition, 2010.\n\n[33] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild.\n\nCVPR, 2012.\n\nIn\n\n[34] M. Zia, M. Stark, B. Schiele, and K. Schindler. Revisiting 3d geometric models for accurate object shape\n\nand pose. In ICCV Workshops, pages 569\u2013576. IEEE, 2011.\n\n9\n\n\f", "award": [], "sourceid": 287, "authors": [{"given_name": "Mohsen", "family_name": "Hejrati", "institution": null}, {"given_name": "Deva", "family_name": "Ramanan", "institution": null}]}