{"title": "Image Parsing with Stochastic Scene Grammar", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 81, "abstract": "This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules rep- resent an ensemble of visual entities. Contextual relations: (i) Cooperative \u201c+\u201d relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive \u201c-\u201d relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree.", "full_text": "Image Parsing via Stochastic Scene Grammar\n\nYibiao Zhao\u2217\n\nDepartment of Statistics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\nybzhao@ucla.edu\n\nSong-Chun Zhu\n\nDepartment of Statistics and Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\n\nsczhu@stat.ucla.edu\n\nAbstract\n\nThis paper proposes a parsing algorithm for scene understanding which includes\nfour aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), de-\ntecting 2D faces (windows, doors etc.), and segmenting background. In contrast\nto previous scene labeling work that applied discriminative classi\ufb01ers to pixels\n(or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This\ngrammar represents the compositional structures of visual entities from scene cat-\negories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes\nthree types of production rules and two types of contextual relations. Production\nrules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii)\nOR rules represent the switching among sub-types of an entity; (iii) SET rules rep-\nresent an ensemble of visual entities. Contextual relations: (i) Cooperative \u201c+\u201d\nrelations represent positive links between binding entities, such as hinged faces of\na object or aligned boxes; (ii) Competitive \u201c-\u201d relations represents negative links\nbetween competing entities, such as mutually exclusive boxes. We design an ef-\n\ufb01cient MCMC inference algorithm, namely Hierarchical cluster sampling, to\nsearch in the large solution space of scene con\ufb01gurations. The algorithm has two\nstages: (i) Clustering: It forms all possible higher-level structures (clusters) from\nlower-level entities by production rules and contextual relations. (ii) Sampling: It\njumps between alternative structures (clusters) in each layer of the hierarchy to\n\ufb01nd the most probable con\ufb01guration (represented by a parse tree). In our exper-\niment, we demonstrate the superiority of our algorithm over existing methods on\npublic dataset. In addition, our approach achieves richer structures in the parse\ntree.\n\n1\n\nIntroduction\n\nScene understanding is an important task in neural information processing systems. By analogy\nto natural language parsing, we pose the scene understanding problem as parsing an image into a\nhierarchical structure of visual entities (in Fig.1(i)) using the Stochastic Scene Grammar (SSG). The\nliterature of scene parsing can be categorized into two categories: discriminative approaches and\ngenerative approaches.\nDiscriminative approaches focus on classifying each pixel (or superpixel) to a semantic label\n(building, sheep, road, boat etc.) by discriminative Conditional Random Fields (CRFs) model [5]-\n[7]. Without an understanding of the scene structure, the pixel-level labeling is insuf\ufb01cient to rep-\nresent the knowledge of object occlusions, 3D relationships, functional space etc. To address this\nproblem, geometric descriptions were added to the scene interpretation. Hoiem et al. [1] and Saxena\net al. [8] generated the surface orientation labels and the depth labels by exploring rich geometric\n\n\u2217http://www.stat.ucla.edu/\u02dcybzhao/research/sceneparsing\n\n1\n\n\fFigure 1: A parse tree of geometric parsing result.\n\nFigure 2: 3D synthesis of novel views based on the parse tree.\n\nfeatures and context information. Gupta et al. [9] posed the 3D objects as blocks and infers its 3D\nproperties such as occlusion, exclusion and stableness in addition to surface orientation labels. They\nshowed the global 3D prior does help the 2D surface labeling. For the indoor scene, Hedau et al.\n[2], Wang et al. [3] and Lee et al. [4] adopted different approaches to model the geometric layout\nof the background and/or foreground objects, and \ufb01t their models into Structured SVM (or Latent\nSVM) settings [10]. The Structured SVM uses features extracted jointly from input-output pairs and\nmaximizes the margin over the structured output space. These algorithms involve hidden variables\nor structured labels in discriminative training. However, these discriminative approaches lack a gen-\neral representation of visual vocabulary and a principled approach for exploring the compositional\nstructure.\nGenerative approaches make efforts to model the recon\ufb01gurable graph structures in generative\nprobabilistic models. The stochastic grammar were used to parse natural languages [11]. Composi-\ntional models for the hierarchical structure and sharing parts were studied in visual object recogni-\ntion [12]-[15]. Zhu and Mumford [16] proposed an AND/OR Graph Model to represent the compo-\nsitional structures in vision. However, the expressive power of con\ufb01gurable graph structures comes\nat the cost of high computational complexity of searching in a large con\ufb01guration space. In order to\naccelerate the inference, the Adaptor Grammars [17] applied an idea of \u201dadaptor\u201d (re-using subtree)\nthat induce dependencies among successive uses. Han and Zhu [18] applied grammar rules, in a\ngreedy manner, to detect rectangular structures in man-made scenes. Porway et al. [19] [20]allowed\nthe Markov chain jumping between competing solutions by a C4 algorithm.\n\n2\n\n(i) a parse tree3D background2D faces1D line segmentsscene3D foregrounds(ii) input image and line detection(iii) geometric parsing result(iv) reconstructed via line segments\fOverview of the approach. In this paper, we parse an image into a hierarchical structure, namely\na parse tree as shown in Fig.1. The parse tree covers a wide spectrum of visual entities, including\nscene categories, 3D foreground/background, 2D faces, and 1D line segments. With the low-level\ninformation of the parse tree, we reconstruct the original image by the appearance of line segments,\nas shown in Fig.1(iv). With the high-level information of the parse tree, we further recover the 3D\nscene by the geometry of 3D background and foreground objects, as shown in Fig.2.\nThis paper has two major contributions to the scene parsing problems:\n(I) A Stochastic Scene Grammar (SSG) is introduced to represent the hierarchical structure of visual\nentities. The grammar starts with a single root node (the scene) and ends with a set of terminal nodes\n(line segments). In between, we generate all intermediate 3D/2D sub-structures by three types of\nproduction rules and two types of contextual relations, as illustrated in Fig.3. Production rules:\nAND, OR, and SET. (i) The AND rule encodes how sub-parts are composed into a larger struc-\nture. For example, three hinged faces form a 3D box, four linked line segments form a rectangle,\na background and inside objects form a scene in Fig.3(i); (ii) The SET rule represents an ensemble\nof entities, e.g. a set of 3D boxes or a set of 2D regions as in Fig.3(ii); (iii)The OR rule repre-\nsents a switch between different sub-types, e.g. a 3D foreground and 3D background have several\nswitchable types in Fig.3(iii). Contextual relations: Cooperative \u201c+\u201d and Competitive \u201c-\u201d. (i) If\nthe visual entities satisfy a cooperative \u201c+\u201d relation, they tend to bind together, e.g. hinged faces of\na foreground box showed in Fig.3(a). (ii) If entities satisfy a competitive \u201c-\u201d relation, they compete\nwith each other for presence, e.g. two exclusive foreground boxes competing for a same space in\nFig.3(b).\n(II) A hierarchical cluster sampling algorithm is proposed to perform inference ef\ufb01ciently in SSG\nmodel. The algorithm accelerates a Markov chain search by exploring contextual relations. It has\ntwo stages: (i) Clustering. Based on the detected line segments in Fig.1(ii), we form all possible\nlarger structures (clusters). In each layer, the entities are \ufb01rst \ufb01ltered by the Cooperative \u201c+\u201d con-\nstraints, they then form a cluster only if they satisfy the \u201c+\u201d constraints, e.g. several faces form a\ncluster of a box when their edges are hinged tightly. (ii) Sampling. The sampling process makes a\nbig reversible jumps by switching among competing sub-structures (e.g. two exclusive boxes).\nIn summary, the Stochastic Scene Grammar is a general framework to parse a scene with a large\nnumber of geometric con\ufb01gurations. We demonstrate the superiority of our algorithm over existing\nmethods in the experiment.\n\n2 Stochastic Scene Grammar\n\n{ri}\u2212\u2212\u2212\u2192 C,{ri} \u2282 R, C \u2282 V T , P ({ri}) > 0}.\n\nThe Stochastic Scene Grammar (SSG) is de\ufb01ned as a four-tuple G = (S, V, R, P ), where S is a\nstart symbol at the root (scene); V = V N \u222a V T , V N is a \ufb01nite set of non-terminal nodes (structures\nor sub-structures), V T is a \ufb01nite set of terminal nodes (line segments); R = {r : \u03b1 \u2192 \u03b2} is\na set of production rules, each of which represents a generating process from a parent node \u03b1 to\nits child nodes \u03b2 = Ch\u03b1. P (r) = P (\u03b2|\u03b1) is an expansion probability for each production rule\n(r : \u03b1 \u2192 \u03b2). A set of all valid con\ufb01gurations C derived from production rules is called a language:\nL(G) = {C : S\nProduction rules. We de\ufb01ne three types of stochastic production rules RAN D,ROR,RSET to repre-\nsent the structural regularity and \ufb02exibility of visual entities. The regularity is enforced by the AND\nrule and the \ufb02exibility is expressed by the OR rule. The SET rule is a mixture of OR and AND rules.\n(i) An AND rule (rAN D : A \u2192 a \u00b7 b \u00b7 c) represents the decomposition of a parent node A into\nthree sub-parts a, b, and c. The probability P (a, b, c|A) measures the compatibility (contextual\nrelations) among sub-structures a, b, c. As seen Fig.3(i), the grammar outputs a high probability if\nthe three faces of a 3D box are well hinged, and a low probability if the foreground box lays out of\nthe background.\n(ii) An OR rule (rOR : A \u2192 a | b) represents the switching between two sub-types a and b of a\nparent node A. The probability P (a|A) indicates the preference for one subtype over others. For\n3D foreground in Fig.3(iii), the three sub-types in the third row represent objects below the horizon.\nThese objects appear with high probabilities. Similarly, for the 3D background in Fig.3(iii), the\ncamera rarely faces the ceiling or the ground, hence, the three sub-types in the middle row have\n\n3\n\n\fFigure 3: Three types of production rules: AND (i), SET (ii) OR (iii), and two types of contextual\nrelations: cooperative \u201c+\u201d relations (a), competitive \u201c-\u201d relations (b).\n\nhigher probabilities (the higher the darker). Moreover, OR rules also model the discrete size of\nentities, which is useful to rule out the extreme large or small entities.\n(iii) An SET rule (rSET : A \u2192 {a}k, k \u2265 0) represents an ensemble of k visual entities. The SET\nrule is equivalent to a mixture of OR and AND rules (rSET : A \u2192 \u2205 | a | a \u00b7 a | a \u00b7 a \u00b7 a | \u00b7\u00b7\u00b7 ).\nIt \ufb01rst chooses a set size k by ORing, and forms an ensemble of k entities by ANDing. It is worth\nnoting that the OR rule essentially changes the graph topology of the output parse tree by changing\nits node size k. In this way, as seen in Fig.3(ii), the SET rule generates a set of 3D/2D entities which\nsatisfy some contextual relations.\nContextual relations. There are two kinds of contextual relations, Cooperative \u201c+\u201d relations and\nCompetitive \u201c-\u201d relations, which involve in the AND and SET rules.\n(i) The cooperative \u201c+\u201d relations specify the concurrent patterns in a scene, e.g. hinged faces, nested\nrectangle, aligned windows in Fig.3(a). The visual entities satisfying a cooperative \u201c+\u201d relation tend\nto bind together.\n(i) The competitive \u201c-\u201d relations specify the exclusive patterns in a scene. If entities satisfy compet-\nitive \u201c-\u201d relations, they compete with each other for the presence. As shown in Fig.3(b), if a 3D box\nis not contained by its background, or two 2D/3D objects are exclusive with one another, these cases\nwill rarely be in a solution simultaneously.\nThe tight structures vs. the loose structure: If several visual entities satisfy a cooperative \u201c+\u201d rela-\ntion, they tend to bind together, and we call them tight structures. These tight structures are grouped\ninto clusters in the early stage of inference (Sect.4). If the entities neither satisfy any cooperative\n\u201c+\u201d relations nor violate a competitive \u201c-\u201d relation, they may be loosely combined. We call them\nloose structures, whose combinations are sampled in a later stage of inference (Sect.4). With the\nthree production rules and two contextual relations, SSG is able to handle an enormous number of\ncon\ufb01gurations and large geometric variations, which are the major dif\ufb01culties in our task.\n\n3 Bayesian formulation of the grammar\n\nWe de\ufb01ne a posterior distribution for a solution (a parse tree) pt conditioned on an input image I.\nThis distribution is speci\ufb01ed in terms of the statistics de\ufb01ned over the derivation of production rules.\n\nP (pt|I) \u221d P (pt)P (I|pt) = P (S)\n\nP (Chv|v)\n\nP (I|v)\n\n(1)\n\n(cid:89)\n\nv\u2208V N\n\n(cid:89)\n\nv\u2208V T\n\nwhere I is the input image, pt is the parse tree. The probability derivation represents a generating\nprocess of the production rules {r : v \u2192 Chv} from the start symbol S to the nonterminal nodes\nv \u2208 V N , and to the children of non-terminal nodes Chv. The generating process stops at the\nterminal nodes v \u2208 V T and generates the image I.\nWe use a probabilistic graphical model of AND/OR graph [12, 17] to formulate our grammar. The\ngraph structure G = (V, E) consists of a set of nodes V and a set of edges E. The edge de\ufb01ne a\n\n4\n\n3D foreground types3D background types(i) AND rules(ii) SET rules(a) \"+\" relations(b) \"-\" relations(iii) OR ruleshinged faceslinked linesaligned facesaligned boxesnested facesstacked boxesexclusive facesinvalid scene layoutexclusive boxes\fFigure 4: Learning to synthesize. (a)-(d) Some typical samples drawn from Stochastic Scene Gram-\nmar model with/without contextual relations.\n\nparent-child conditional dependency for each production rule. The posterior distribution of a parse\ngraph pt is given by a family of Gibbs distributions: P (pt|I; \u03bb) = 1/Z(I; \u03bb) exp{\u2212E(pt|I)}, where\n\nZ(I; \u03bb) =(cid:80)\npt\u2208\u2126 exp{\u2212E(pt|I)} is a partition function summation over the solution space \u2126.\n(cid:88)\n\nThe energy is decomposed into three potential terms:\n\n(cid:88)\n\n(cid:88)\n\nEOR(AT (Chv)) +\n\nEAN D(AG(Chv)) +\n\nE(pt|I) =\n\nv\u2208V OR\n\nv\u2208V AND\n\nET (I(\u039bv))\n\n(2)\n\n\u039bv\u2208\u039bI ,v\u2208V T\n\n(cid:80)\n\na,b\u2208Chv\n\n(cid:80)\n\n(i) The energy for OR nodes is de\ufb01ned over \u201dtype\u201d attributes AT (Chv) of ORing child nodes.\nThe potential captures the prior statistics on each switching branch. EOR(AT (v)) = \u2212 log P (v \u2192\nAT (v)) = \u2212 log{ #(v\u2192AT (v))\nu\u2208Ch(v) #(v\u2192u) }. The switching probability of foreground objects and the\nbackground layout is shown in Fig.3(iii).\n(ii) The energy for AND nodes is de\ufb01ned over \u201dgeometry\u201d attribute AG(Chv) of ANDing child\nnodes. They are Markov Random Fields (MRFs) inside a tree-structure. We de\ufb01ne both \u201c+\u201d rela-\ntions and \u201c-\u201d relations as EAN D = \u03bb+h+(AG(Chv)) + \u03bb\u2212h\u2212(AG(Chv)), where h(\u2217) are suf-\n\ufb01cient statistics in the exponential model, \u03bb are their parameters. For 2D faces as an example,\nthe \u201c+\u201d relation speci\ufb01es a quadratic distance between their connected joints h+(AG(Chv)) =\n(X(a) \u2212 X(b))2, and the \u201c-\u201d relation speci\ufb01es an overlap rate between their occupied\nimage area h\u2212(AG(Chv)) = (\u039ba \u2229 \u039bb)/(\u039ba \u222a \u039bb), a, b \u2208 Chv.\n(iii) The energy for Terminal nodes is de\ufb01ned over bottom-up image features I(\u039bv) on the image\narea \u039bv. The features used in this paper include: (a) surface labels of geometric context [1], (b) a\n3D orientation map [21], (c) the MDL coding length of line segments [20]. This term only captures\nthe features from their dominant image area \u039bv, and avoids the double counting of the shared edges\nand the occluded areas.\nWe learn the context-sensitive grammar model of SSG from a context-free grammar. Under the\nlearning framework of minimax entropy [25], we enforce the contextual relations by adding sta-\ntistical constraints sequentially. The learning process matches the statistics between the current\ndistribution p and a targeted distribution f by adding the most violated constraint in each iteration.\nFig.4 shows the typical samples drawn from the learned SSG model. With more contextual relations\nbeing added, the sampled con\ufb01gurations become more similar to a real scene, and the statistics of\nthe learned distribution become closer to that of target distribution.\n\n4\n\nInference with hierarchical cluster sampling\n\nWe design a hierarchical cluster sampling algorithm to infer the optimal parse tree for the SSG\nmodel. A parse tree speci\ufb01es a con\ufb01guration of visual entities. The combination of con\ufb01gurations\nmakes the solution space expand exponentially, and it is NP-hard to enumerate all parse trees in such\na large space.\n\n5\n\n(i) initial distribution (iii) with competitive(-) relations(iv) with both (+/-) relations(ii) with cooperative(+) relations\fFigure 5: The hierarchical cluster sampling process.\n\nIn order to detecting scene components, neither sliding window (top-down) nor binding (bottom-up)\napproaches can handle the large geometric variations and an enormous number of con\ufb01gurations.\nIn this paper we combine the bottom-up and top-down process by exploring the contextual relations\nde\ufb01ned on the grammar model. The algorithm \ufb01rst perform a bottom-up clustering stage and follow\nby a top-down sampling stage.\nIn the clustering stage, we group visual entities into clusters (tight structures) by \ufb01ltering the enti-\nties based on cooperative \u201c+\u201d relations. With the low-level line segments as illustrated in Fig.1.(iv),\nwe detect substructures, such as 2D faces, aligned and nested 2D faces, 3D boxes, aligned and\nstacked 3D boxes (in Fig.3(a)) layer by layer. The clusters Cl are formed only if the cooperative \u201c+\u201d\nconstraints are satis\ufb01ed. The proposal probability for each cluster Cl is de\ufb01ned as\n\n(cid:89)\n\nP+(Cl|I) =\n\n(cid:89)\n\n(cid:89)\n\nv\u2208ClT\n\nP OR(AT (v))\n\nP AN D\n\n+\n\n(AG(u), AG(v))\n\nP T (I(\u039bv)).\n\n(3)\n\nv\u2208ClOR\n\nu,v\u2208ClAND\n\nClusters with marginal probabilities below threshold are pruned. The threshold is learned by a\nprobably approximately admissible (PAA) bound [23]. The clusters so de\ufb01ned are enumerable.\nIn the sampling stage, we performs an ef\ufb01cient MCMC inference to search in the combinational\nspace. In each step, the Markov chain jumps over a cluster (a big set of nodes) given information of\n\u201dwhat goes together\u201d from clustering. The algorithm proposes a new parse tree: pt\u2217 = pt+Cl\u2217 with\nthe cluster Cl\u2217 conditioning on the current parse tree pt. To avoid heavy computation, the proposal\nprobability is de\ufb01ned as\n\nQ(pt\u2217|pt, I) = P+(Cl\u2217|I)\n\nP AN D\u2212\n\n(AG(u)|AG(v)).\n\n(4)\n\n(cid:89)\n\nu\u2208ClAND,v\u2208ptAND\n\nThe algorithm gives more weights to the proposals with strong bottom-up support and tight \u201c+\u201d\nrelations by P+(Cl|I), and simultaneously avoids the exclusive proposals with \u201c-\u201d relations by\n(AG(u)|AG(v)). All of these probabilities are pre-computed before sampling. The marginal\nP AN D\u2212\nprobability of each cluster P+(Cl|I) is computed during the clustering stage, and the probability\n(AG(u)|AG(v)) is then calculated and stored in a\nfor each pair-wise negative \u201c-\u201d relations P AN D\u2212\nlook-up table. The algorithm also proposes a new parse tree by pruning current parse tree randomly.\nBy applying the Metropolis-Hastings acceptance probability \u03b1(pt \u2192 pt\u2217) = min{1, Q(pt|pt\u2217,I)\nQ(pt\u2217|pt,I) \u00b7\nP (pt\u2217|I)\nP (pt|I) }, the Markov chain search satis\ufb01es the detailed balance principle, which implies that the\nMarkov chain search will converge to the global optimum in Fig.5.\n\n5 Experiments\n\nWe evaluate our algorithm on both the UIUC indoor dataset [2] and our own dataset. The UIUC\ndataset contains 314 cluttered indoor images, of which the ground-truth is two label maps of back-\nground layout with/without foreground objects. Our dataset contains 220 images which cover six\n\n6\n\n501001502002503007006005004003002000iterationsenergyiteration 100iteration 300iteration 250iteration 200iteration 150iteration 50iteration 0\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Quantitative performance of 2D face detection (a) and 3D foreground detection (b) in our\ndataset. (c) An example of the top proposals and the result after inference.\n\nindoor scene categories: bedroom, living room, kitchen, classroom, of\ufb01ce room, and corridor. The\ndataset is available on the project webpage1. The ground-truths are hand labeled segments for scene\ncomponents for each image. Our algorithm usually takes 20s in clustering, 40s in sampling, and 1m\nin preparing input features.\nQualitative evaluation: The experimental results in Fig.7 is obtained by applying different pro-\nduction rules to images in our dataset. With the AND rules only, the algorithm obtains reasonable\nresults and successfully recovers some salient 3D foreground objects and 2D faces. With both the\nAND and SET rules, the cooperative \u201c+\u201d relations help detect some weak visual entities. Fig.8 lists\nmore experimental results of the UIUC dataset. The proposed algorithm recovers most of the indoor\ncomponents. In the last row, we show some challenging images with missing detections and false\npositives. Weak line information, ambiguous overlapping objects, salient patterns and clustered\nstructures would confuse our algorithm.\nQuantitative evaluation: We \ufb01rst evaluate the detection of 2D faces, 3D foreground objects in\nour dataset. The detection error is measured on the pixel level, it indicates how many pixels are\ncorrectly labelled. In Fig.6, the red curves show the ROC of 2D faces / 3D objects detection in\nclustering stage. They are computed by thresholding cluster probabilities given by Eq.3. The blue\ncurves show the ROC of \ufb01nal detection given a partial parse tree after MCMC inference. They are\ncomputed by thresholding the marginal probability given Eq.2. Using the UIUC dataset, we compare\nour algorithm to four other state-of-the-art indoor scene parsing algorithms, Hoiem et al. [1], Hedau\net al. [2], Wang et al. [3] and Lee et al. [4]. All of these four algorithms used discriminative learning\nof Structure-SVM (or Latent-SVM). By applying the production rules and the contextual relations,\nour generative grammar model outperforms others as shown in Table.1.\n\n6 Conclusion\n\nIn this paper, we propose a framework of geometric image parsing using Stochastic Scene Grammar\n(SSG). The grammar model is used to represent the compositional structure of visual entities. It\nis beyond the traditional probabilistic context-free grammars (PCFGs) in a few aspects: spatial\ncontext, production rules for multiple occurrences of objects, richer image appearance and geometric\nproperties. We also design a hierarchical cluster sampling algorithm that uses contextual relations\nto accelerate the Markov chain search. The SSG model is \ufb02exible to model other compositional\nstructures by applying different production rules and contextual relations. An interesting extension\nof our work can be adding semantic labels, such as chair, desk, shelf etc., to 3D objects. This will\nbe interesting to discover new relations between TV and sofa, desk and chair, bed and night table as\ndemonstrated in [26].\n\nAcknowledgments\n\nThe work is supported by grants from NSF IIS-1018751, NSF CNS-1028381 and ONR MURI\nN00014-10-1-0933.\n\n1http://www.stat.ucla.edu/\u02dcybzhao/research/sceneparsing\n\n7\n\n2D face detection00.20.40.60.81False nagative rateTrue positive ratecluster proposalsafter inference00.20.40.60.813D foreground detection00.20.40.60.81False nagative rateTrue positive ratecluster proposalsafter inference00.20.40.60.81\fFigure 7: Experimental results by applying the AND/OR rules (the \ufb01rst row) and applying all\nAND/OR/SET rules (the second row) in our dataset\n\nFigure 8: Experimental results of more complex indoor images in UIUC dataset [2]. The last row\nshows some challenging images with missing detections and false positives of proposed algorithm.\n\nTable 1: Segmentation precision compared with Hoiem et al. 2007 [1], Hedau et al. 2009 [2], Wang\net al. 2010 [3] and Lee et al. 2010 [4] in the UIUC dataset [2].\n[3]\n\nSegmentation precision\n\nOur method\n\n[4]\n\n[1]\n\n[2]\n\nWithout rules\n\nWith 3D \u201c-\u201d constraints\nWith AND, OR rules\n\nWith AND, OR, SET rules\n\n-\n-\n-\n\n73.5% 78.8% 79.9% 81.4%\n83.8%\n\n-\n-\n-\n\n-\n-\n\n80.5%\n84.4%\n85.1%\n85.5%\n\n-\n-\n-\n\n8\n\n\fReferences\n\n[1] Hoiem, D., Efors, A., & Hebert, M. (2007) Recovering Surface Layout from an Image IJCV 75(1).\n[2] Hedau, V., Hoiem, D., & Forsyth, D. (2009) Recovering the spatial layout of cluttered rooms. In ICCV.\n[3] Wang, H., Gould, S. & Koller, D. (2010) Discriminative Learning with Latent Variables for Cluttered Indoor\nScene Understanding. ECCV.\n[4] Lee, D., Gupta, A. Hebert, M., & Kanade, T. (2010) Estimating Spatial Layout of Rooms using Volumetric\nReasoning about Objects and Surfaces Advances in Neural Information Processing Systems 7, pp. 609-616.\nCambridge, MA: MIT Press.\n[5] Shotton, J., & Winn, J. (2007) TextonBoost for Image Understanding: Multi-Class Object Recognition and\nSegmentation by Jointly Modeling Texture, Layout, and Context. IJCV\n[6] Tu, Z., & Bai, X. (2009) Auto-context and Its Application to High-level Vision Tasks and 3D Brain Image\nSegmentation PAMI\n[7] Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random \ufb01elds: probabilistic models\nfor segmenting and labeling sequence data. In ICML (pp. 282-289).\n[8] Saxena, A., Sun, M. & Ng, A. (2008) Make3d: Learning 3D scene structure from a single image. PAMI.\n[9] Gupta, A., Efros,A., & Hebert, M. (2010) Blocks World Revisited: Image Understanding using Qualitative\nGeometry and Mechanics. ECCV.\n[10] Tsochantaridis, T. Joachims, T. Hofmann & Y. Altun (2005) Large Margin Methods for Structured and\nInterdependent Output Variables, JMLR, Vol. 6, pages 1453-1484.\n[11] Manning, C., & Schuetze, H. (1999) Foundations of statistical natural language processing. Cambridge:\nMIT Press.\n[12] Chen, H., Xu, Z., Liu, Z., & Zhu, S. C. (2006) Composite templates for cloth modeling and sketching. In\nCVPR (1) pp. 943-950.\n[13] Jin, Y., & Geman, S. (2006) Context and hierarchy in a probabilistic image model.\n2145-2152.\n[14] Zhu, L., & Yuille, A. L. (2005) A hierarchical compositional system for rapid object detection. Advances\nin Neural Information Processing Systems 7, pp. 609-616. Cambridge, MA: MIT Press.\n[15] Fidler, S., & Leonardis, A. (2007) Towards Scalable Representations of Object Categories: Learning a\nHierarchy of Parts. In CVPR.\n[16] Zhu, S. C., & Mumford, D. (2006) A stochastic grammar of images. Foundations and Trends in Computer\nGraphics and Vision, 2(4), 259-362.\n[17] Johnson, M., Grif\ufb01ths, T. L, & Goldwater, S. (2007) Adaptor Grammars: A Framework for Specifying\nCompositional Nonparametric Bayesian Models. In G. Tesauro, D. S. Touretzky and T.K. Leen (eds.), Advances\nin Neural Information Processing Systems 7, pp. 609-616. Cambridge, MA: MIT Press.\n[18] Han, F., & Zhu, S. C. (2009) Bottom-Up/Top-Down Image Parsing with Attribute Grammar PAMI\n[19] Porway, J., & Zhu, S. C. (2010) Hierarchical and Contextual Model for Aerial Image Understanding. Int\u2019l\nJournal of Computer Vision, vol.88, no.2, pp 254-283.\n[20] Porway, J., & Zhu, S. C. (2011) C4 : Computing Multiple Solutions in Graphical Models by Cluster\nSampling. PAMI, vol.33, no.9, 1713-1727.\n[21] Lee, D., Hebert, M., & Kanade, T. (2009) Geometric Reasoning for Single Image Structure Recovery In\nCVPR.\n[22] Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking Inside the Box: Using Appearance Models and\nContext Based on Room Geometry. In ECCV.\n[23] Felzenszwalb, P.F. (2010) Cascade Object Detection with Deformable Part Models. In CVPR.\n[24] Pero, L. D., Guan, J., Brau, E. Schlecht, J. & Barnard, K. (2011) Sampling Bedrooms. In CVPR.\n[25] Zhu, S. C., Wu, Y., & Mumford, D. (1997) Minimax Entropy Principle and Its Application to Texture\nModeling. Neural Computation 9(8): 1627-1660.\n[26] Yu, L. F., Yeung, S. K., Tang, C. K., Terzopoulos, D., Chan, T. F. & Osher, S. (2011) Make it home:\nautomatic optimization of furniture arrangement. ACM Transactions on Graphics 30(4): pp.86\n\nIn CVPR (2) pp.\n\n9\n\n\f", "award": [], "sourceid": 72, "authors": [{"given_name": "Yibiao", "family_name": "Zhao", "institution": null}, {"given_name": "Song-chun", "family_name": "Zhu", "institution": null}]}