{"title": "Learning Hybrid Models for Image Annotation with Partially Labeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 625, "page_last": 632, "abstract": "Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on two real image datasets demonstrate the effectiveness of incorporating the latent structure.", "full_text": "Learning Hybrid Models for Image Annotation with\n\nPartially Labeled Data\n\nXuming He\n\nDepartment of Statistics\n\nUCLA\n\nRichard S. Zemel\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nhexm@stat.ucla.edu\n\nzemel@cs.toronto.edu\n\nAbstract\n\nExtensive labeled data for image annotation systems, which learn to assign class\nlabels to image regions, is dif\ufb01cult to obtain. We explore a hybrid model frame-\nwork for utilizing partially labeled data that integrates a generative topic model\nfor image appearance with discriminative label prediction. We propose three al-\nternative formulations for imposing a spatial smoothness prior on the image la-\nbels. Tests of the new models and some baseline approaches on three real image\ndatasets demonstrate the effectiveness of incorporating the latent structure.\n\n1 Introduction\n\nImage annotation, or image labeling, in which the task is to label each pixel or region of an image\nwith a class label, is becoming an increasingly popular problem in the machine learning and machine\nvision communities [7, 14]. State-of-the-art methods formulate image annotation as a structured\nprediction problem, and utilize methods such as Conditional Random Fields [8, 4], which output\nmultiple values for each input item. These methods typically rely on fully labeled data for optimiz-\ning model parameters. It is widely acknowledged that consistently-labeled images are tedious and\nexpensive to obtain, which limits the applicability of discriminative approaches. However, a large\nnumber of partially-labeled images, with a subset of regions labeled in an image, or only captions\nfor images, are available (e.g., [12]). Learning labeling models with such data would help improve\nsegmentation performance and relax the constraint of discriminative labeling methods.\n\nA wide range of learning methods have been developed for using partially-labeled image data. One\napproach adopts a discriminative formulation, and treats the unlabeled regions as missing data [16],\nOthers take a semi-supervised learning approach by viewing unlabeled image regions as unlabeled\ndata. One class of these methods generalizes traditional semi-supervised learning to structured pre-\ndiction tasks [1, 10]. However, the common assumption about the smoothness of the label distri-\nbution with respect to the input data may not be valid in image labeling, due to large intra-class\nvariation of object appearance. Other semi-supervised methods adopt a hybrid approach, combining\na generative model of the input data with a discriminative model for image labeling, in which the\nunlabeled data are used to regularize the learning of a discriminative model [6, 9]. Only relatively\nsimple probabilistic models are considered in these approaches, without capturing the contextual\ninformation in images.\n\nOur approach described in this paper extends the hybrid modeling strategy by incorporating a more\n\ufb02exible generative model for image data. In particular, we introduce a set of latent variables that\ncapture image feature patterns in a hidden feature space, which are used to facilitate the labeling\ntask. First, we extend the Latent Dirichlet Allocation model (LDA) [3] to include not only input\nfeatures but also label information, capturing co-occurrences within and between image feature\npatterns and object classes in the data set. Unlike other topic models in image modeling [11, 18],\nour model integrates a generative model of image appearance and a discriminative model of region\n\n1\n\n\flabels. Second, the original LDA structure does not impose any spatial smoothness constraint to\nlabel prediction, yet incorporating such a spatial prior is important for scene segmentation. Previous\napproaches have introduced lateral connections between latent topic variables [17, 15]. However,\nthis complicates the model learning, and as a latent representation of image data, the topic variables\ncan be non-smooth over the image plane in general. In this paper, we model the spatial dependency\nof labels by two different structures: one introduces directed connections between each label variable\nand its neighboring topic variables, and the other incorporates lateral connections between label\nvariables. We will investigate whether these structures effectively capture the spatial prior, and lead\nto accurate label predictions.\n\nThe remainder of this paper is organized as follows. The next section presents the base model,\nand two different extensions to handle label spatial dependencies. Section 3 and 4 de\ufb01ne inference\nand learning procedures for these models. Section 5 describes experimental results, and in the \ufb01nal\nsection we discuss the model limitations and future directions.\n\n2 Model description\n\nThe structured prediction problem in image labeling can be formulated as follows. Let an image\nx be represented as a set of subregions {xi}Nx\ni=1. The aim is to assign each xi a label li from a\ncategorical set L. For instance, subregion xi\u2019s can be image patches or pixels, and L consists of\nobject classes. Denote the set of labels for x as l = {li}Nx\ni=1. A key issue in structured prediction\nconcerns how to capture the interactions between labels in l given the input image.\nModel I. We \ufb01rst introduce our base model for capturing individual patterns in image appearance\nand label space. Assume each subregion xi is represented by two features (ai, ti), in which ai\ndescribes its appearance (including color, texture, etc.) in some appearance feature space A and\nti is its position on the image plane T . Our method focuses on the joint distribution of labels\nand subregion appearances given positions by modeling co-occurred patterns in the joint space of\nL \u00d7 A. We achieve this by extending the latent Dirichlet allocation model to include both label and\nappearance.\nMore speci\ufb01cally, we assume each observation pair (ai, li) in image x is generated from a mixture\nof K hidden \u2018topic\u2019 components shared across the whole dataset, given the position information ti.\nFollowing the LDA notation, the mixture proportion is denoted as \u03b8, which is image-speci\ufb01c and\nshares a common Dirichlet prior parameterized by \u03b1. Also, zi is used as an indicator variable to\nspecify from which hidden topic component the pair (ai,li) is generated. In addition, we use a to\ndenote the appearance feature vector of each image, z for the indicator vector and t for the position\nvector. Our model de\ufb01nes a joint distribution of label variables l and appearance feature variables a\ngiven the position t as follows,\n\nPb(l, a|t, \u03b1) = Z\u03b8\n\n[Yi Xzi\n\nP (li|ai, ti, zi)P (ai|zi)P (zi|\u03b8)]P (\u03b8|\u03b1)d\u03b8\n\n(1)\n\nwhere P (\u03b8|\u03b1) is the Dirichlet distribution. We specify the appearance model P (ai|zi) to be position\ninvariant but the label predictor P (li|ai, ti, zi) depends on the position information. Those two\ncomponents are formulated as follows, and the graphical representation of the model is shown in the\nleft panel of Figure 1.\n(a) Label prediction module P (li|ai, ti, zi). The label predictor P (li|ai, ti, zi) is modeled by a\nprobabilistic classi\ufb01er that takes (ai, ti, zi) as its input and produces a properly normalized distribu-\ntion for li. Note that we represent zi in its \u20180-1\u2019 vector form when it is used as the classi\ufb01er input.\nSo if the dimension of A is M, then the input dimension of the classi\ufb01er is M + K + 2. We use a\nMLP with one hidden layer in our experiments, although other strong classi\ufb01ers are also feasible.\n(b) Image appearance module P (ai|zi). We follow the convention of topic models and model the\ntopic conditional distributions of the image appearance using a multinomial distribution with param-\neters \u03b2zi. As the appearance features typically take on real values, we \ufb01rst apply k-means clustering\nto the image features {ai} to build a visual vocabulary V. Thus a feature ai in the appearance space\nA can be represented as a visual word v, and we have P (ai = v|zi = k) = \u03b2k,v.\nWhile the topic prediction model in Equation 1 is able to capture regularly co-occurring patterns in\nthe joint space of label and appearance, it ignores spatial priors on the label prediction. However,\n\n2\n\n\f\u03b1\n\n\u03b8\n\niz\n\nai\n\nli\n\nit\n\n\u03b2\n\nK\n\nzi\u22121\n\nai\u22121\n\ni\u22121l\n\n\u03b2\n\nK\n\nN\n\nD\n\n\u03b1\n\n\u03b8\n\nz\ni\n\nai\n\nli\n\nt\n\n\u03b1\n\n\u03b8\n\nz\ni\n\nai\n\nli\n\nt\n\nz\ni+1\n\na\ni+1\n\nl\ni+1\n\nD\n\nzi\u22121\n\nai\u22121\n\ni\u22121l\n\n\u03b2\n\nK\n\nz\ni+1\n\na\ni+1\n\nl\ni+1\n\nD\n\nFigure 1: Left:A graphical representation of the base topic prediction model (Model I). Middle:\nModel II. Right: Model III. Circular nodes are random variables, and shaded nodes are observed. N\nis the number of image features in each image, and D denotes all the training data.\n\nspatial priors, such as spatial smoothness, are crucial to labeling tasks, as neighboring labels are\nusually strongly correlated. To incorporate spatial information, we extend our base model in two\ndifferent ways as follows.\nModel II. We introduce a dependency between each label variable and its neighboring topic vari-\nables. In this model, each label value is predicted based on the summary information of topics within\na neighborhood. More speci\ufb01cally, we change the label prediction model into the following form:\n\nP (li|ai, ti, zN (i)) = P (li|ai, ti,Xj\u2208N (i)\n\nwjzj),\n\n(2)\n\nwhere N (i) is a prede\ufb01ned neighborhood for site i, and wj is the weight for the topic variable\nzj. We set wj \u221d exp(\u2212|ti \u2212 tj|/\u03c32), and normalized to 1, i.e., Pj\u2208N (i) wj = 1. The graphical\n\nrepresentation is shown in the middle panel of Figure 1. This model variant can be viewed as an\nextension to the supervised LDA [2]. Here, however, rather than a single label applying to each\ninput example instead there are multiple labels, one for each element of x.\nModel III. We add lateral connections between label variables to build a Conditional Random Field\nof labels. The joint label distribution given input image is de\ufb01ned as\n\nP (l|a, t, \u03b1) =\n\n1\nZ\n\nexp{Xi,j\u2208N (i)\n\nf (li, lj) + \u03b3Xi\n\nlog Pb(li|a, t, \u03b1)},\n\n(3)\n\nwhere Z is the partition function. The pairwise potential f (li, lj) = Pa,b uab\u03b4li,a\u03b4lj ,b, and the\nunary potential is de\ufb01ned as log output of the base topic prediction model weighted by \u03b3. Here \u03b4 is\nthe Kronecker delta function. Note that Pb(li|a, t, \u03b1) = Pzi\nP (li|ai, ti, zi)P (zi|a, t). This model\n\nis shown in the right panel of Figure 1.\n\nNote that the base model (Model I) obtains spatially smooth labels simply through the topics cap-\nturing location-dependent co-occurring appearance/label patterns, which tend to be nearby in image\nspace. Model II explicitly predicts a region\u2019s label from the topics in its local neighborhood, so that\nneighboring labels share similar contexts de\ufb01ned by latent topics. In both of these models, the in-\nteraction between labels takes effect through the hidden input representation. The third model uses\na conventional form of spatial dependency by directly incorporating local smoothing in the label\n\ufb01eld. While this structure may impose a stronger spatial prior than other two, it also requires more\ncomplicated learning methods.\n\n3 Inference and Label Prediction\n\nGiven a new image x = {a, t} and our topic models, we predict its labeling based on the Maximum\nPosterior Marginals (MPM) criterion:\nl\u2217\ni = arg max\n\nP (li|a, t).\n\n(4)\n\nli\n\nWe consider the label inference procedure for three models separately as follows.\nModels I&II: The marginal label distribution P (li|a, t) can be computed as:\n\nP (li|a, t) = XzN (i)\n\nP (li|ai, ti,Xj\u2208N (i)\n\nwjzj)P (zN (i)|a, t)\n\n(5)\n\n3\n\n\fThe summation here is dif\ufb01cult when N (i) is large. However, it can be approximated as follows.\n\nDenote vi = Pj\u2208N (i) wjzj and vi,q = Pj\u2208N (i) wjq(zj), where q(zj) = {P (zj|a, t)} is the vector\nform of posterior distribution. Both vi and vi,q are in [0, 1]K. The marginal label distribution can\nbe written as P (li|a, t) = hP (li|ai, ti, vi)iP (zN (i)|a,t). We take the \ufb01rst-order approximation of\nP (li|ai, ti, vi) around vi,q using Taylor expansion:\n\nP (li|ai, ti, vi) \u2248 P (li|ai, ti, vi,q) + (vi \u2212 vi,q)T \u00b7 \u2207viP (li|ai, ti, vi)|vi,q .\n\n(6)\nTaking expectation on both sides of Equation 6 w.r.t. P (zN (i)|a, t) (notice that hviiP (zN (i)|a,t) =\nvi,q), we have the following approximation: P (li|a, t) \u2248 PzN (i)\nmodel, i.e., Pb(li|a, t) = Pzi\n\nModel III: We \ufb01rst compute the unary potential of the CRF model from the base topic prediction\nP (li|ai, ti, zi)P (zi|a, t). Then the label marginals in Equation 4 are\n\ncomputed by applying loopy belief propagation to the conditional random \ufb01eld.\n\nP (li|ai, ti,Pj\u2208N (i) wjq(zj)).\n\nIn both situations, we need the conditional distribution of the hidden topic variables z given observed\ndata components to compute the label prediction. We take a Gibbs sampling approach by integrating\nout the Dirichlet variable \u03b8. From Equation 1, we can derive the posterior of each topic variable zi\ngiven other variables, which is required by Gibbs sampling:\n\nP (zi = k|z\u2212i, ai) \u221d P (ai|zi)(\u03b1k +Xm\u2208S\\i\n\n\u03b4zm,k)\n\n(7)\n\nwhere z\u2212i denotes all the topic variables in z except zi, and S is the set of all sites. Given the\nsamples of the topic variables, we estimate their posterior marginal distribution P (zi|a, x) by simply\ncomputing their normalized histograms.\n\n4 Learning with partially labeled data\n\nHere we consider estimating the parameters of both extended models from a partially labeled image\nset D = {xn, ln}. For an image xn, its label ln = (ln\no denotes the observed labels,\nand ln\nh are missing. We also use o to denote the set of labeled regions. As the three models are built\nwith different components, we treat them separately.\nModels I&II. We use the Maximum Likelihood criterion to estimate the model parameters. Let \u0398\nbe the parameter set of the model,\n\nh) in which ln\n\no , ln\n\n\u0398\u2217 = arg max\n\n\u0398 Xn\n\nlog P (ln\n\no , an|tn; \u0398)\n\n(8)\n\nWe maximize the log data likelihood by Monte Carlo EM. The lower bound of the likelihood can be\nwritten as\n\nQ = Xn\n\nhXi\u2208o\n\nlog P (ln\n\ni |an\n\ni , tn\n\ni , zn\n\nN (i)) +Xi\n\nlog P (an\n\ni |zn\n\ni ) + log P (z)iP (zn|ln\n\no ,an)\n\n(9)\n\nIn the E step, the posterior distributions of the topic variables are estimated by a Gibbs sampling\nprocedure similar to Equation 7. It uses the following conditional probability:\n\nP (zi = k|z\u2212i, ai, l, t) \u221d Yj\u2208N (i)\u2229o\n\nP (lj|aj, tj, zN (j))P (ai|zi)(\u03b1k + Xm\u2208S\\i\n\n\u03b4zm,k)\n\n(10)\n\nNote that any label variable is marginalized out if it is missing.\nIn the M step, we update the\nmodel parameters by maximizing the lower bound Q. Denote the posterior distribution of z as q(\u00b7),\nthe updating equation for parameters of the appearance module P (a|z) can be derived from the\nstationary point of Q:\n\nThe classi\ufb01er in the label prediction module is learned by maximizing the following log likelihood,\n\nk,v \u221d Xn,i\n\u03b2\u2217\n\nq(zn\n\ni = k)\u03b4(an\n\ni , v).\n\n(11)\n\nLc = Xn,i\u2208o\n\nhlog P (ln\n\ni |an\n\ni , tn\n\ni , Xj\u2208N (i)\n\nlog P (ln\n\ni |an\n\ni , tn\n\ni , Xj\u2208N (i)\n\nwjq(zj)).\n\n(12)\n\nwjzj)iq(zN (i)) \u2248 Xn,i\u2208o\n\n4\n\n\fwhere the approximation takes the same form as in Equation 6. We use a gradient ascent algorithm\nto update the classi\ufb01er parameters. Note that we need to run only a few iterations at each M step,\nwhich reduces training time.\nModel III. We estimate the parameters of Model III in two stages: (1). The parameters of the base\ntopic prediction model are learned using the same procedure as in Models I&II. More speci\ufb01cally,\nwe set N (i) = i and estimate the parameters of the appearance module and label classi\ufb01er based\non Maximum Likelihood. (2). Given the base topic prediction model, we compute the marginal\nlabel probability Pb(li|a, t) and plug in the unary potential function in the CRF model (see Equa-\ntion 3). We then estimate the parameters in the CRF by maximizing conditional pseudo-likelihood\nas follows:\n\nLp = Xn Xi\u2208o\ni = Pli\n\n\uf8eb\n\uf8edlog exp{ Xj\u2208N (i)Xa,b\nexp{Pj\u2208N (i)Pa,b uab\u03b4li,a\u03b4ln\n\nuab\u03b4ln\n\ni ,a\u03b4ln\n\nj ,b + \u03b3 log Pb(ln\n\ni \uf8f6\ni |an, tn)} \u2212 log Z n\n\uf8f8 .\n\n(13)\n\nj ,b + \u03b3 log Pb(li|a, t)} is the normalizing constant.\nwhere Z n\nAs this cost function is convex, we use a simple gradient ascent method to optimize the conditional\npseudo-likelihood.\n\n5 Experimental evaluation\n\nData sets and representation. Our experiments are based on three image datasets. The \ufb01rst is a\nsubset of the Microsoft Research Cambridge (MSRC) Image Database [14] as in [16]. This subset\nincludes 240 images and 9 different label classes. The second set is the full MSRC image dataset,\nincluding 591 images and 21 object classes. The third set is a labeled subset of the Corel database\nas in [5] (referred therein as Corel-B). It includes 305 manually labeled images with 11 classes,\nfocusing on animals and natural scenes.\n\nWe use the normalized cut segmentation algorithm [13] to build a super-pixel representation of the\nimages, in which the segmentation algorithm is tuned to generate approximately 1000 segments for\neach image on average. We extract a set of basic image features, including color, edge and texture\ninformation, from each pixel site. For the color information, we transform the RGB values into CIE\nLab* color space. The edge and texture are extracted by a set of \ufb01lter-banks including a difference-\nof-Gaussian \ufb01lter at 3 different scales, and quadrature pairs of oriented even- and odd-symmetric\n\ufb01lters at 4 orientations and 3 scales.The color descriptor of a super-pixel is the average color over\nthe pixels in that super-pixel. For edge and texture descriptors, we \ufb01rst discretize the edge/texture\nfeature space by k-means, and use each cluster as a bin. Then we compute the normalized histograms\nof the features within a super-pixel as the edge/texture descriptor. In the experiments reported here,\nwe used 20 bins for edge information and 50 bins for texture information. We also augment each\nfeature by a SIFT descriptor extracted from a 30 \u00d7 30 image patch centered at the super-pixel. The\nimage position of a super-pixel is the average position of its pixels. To compute the vocabulary of\nvisual words in the topic model, we apply k-means to group the super-pixel descriptors into clusters.\nThe cluster centers are used as visual words and each descriptor is encoded by its word index.\nComparison methods. We compare our approach directly with two baseline systems: a super-\npixel-wise classi\ufb01er and a basic CRF model. We also report the experimental results from [16],\nalthough they adopt a different data representation in their experiments (patches rather than super-\npixels). The super-pixel-wise classi\ufb01er is an MLP with one hidden layer, which predicts labels for\neach super-pixel independently. The MLP has 30 hidden units, a number chosen based on validation\nperformance. In the basic CRF, the conditional distribution of the labels of an image is de\ufb01ned as:\n\nP (l|a, t) \u221d exp{Xi,j Xu,v\n\n\u03c3u,v\u03b4li,u\u03b4lj ,v + \u03b3Xi\n\nh(li|ai, ti)}\n\n(14)\n\nwhere h(\u00b7) is the log output from the super-pixel classi\ufb01er. We train the CRF model by maximizing\nits conditional pseudo-likelihood, and label the image based on the marginal distribution of each\nlabel variable, computed by the loopy belief propagation algorithm.\nPerformance on MSRC-9. Following the setting in [16], we randomly split the dataset into training\nand testing sets with equal size, and use 10% training data as our validation set. In this experiment,\n\n5\n\n\fTable 1: A comparison of classi\ufb01cation accuracy of the 3 variants of our model with other methods.\nThe average classi\ufb01cation accuracy is at the pixel level.\nsky\n92.9\n94.2\n93.5\n93.5\n94.6\n95.7\n\nplane\n37.5\n62.0\n65.1\n72.4\n77.9\n78.3\n\ngrass\n93.2\n94.4\n93.0\n94.1\n92.5\n91.1\n\nTotal\n74.2\n83.5\n79.7\n85.5\n86.7\n84.9\n\ncar\n56.0\n80.1\n61.3\n69.5\n74.7\n84.5\n\nLabel\nS Class\n\nCRF\n\nModel I\nModel II\nModel III\n\n[16]\n\ntree\n71.3\n82.1\n76.6\n81.4\n85.4\n82.1\n\ncow\n57.0\n73.3\n72.0\n80.2\n86.7\n73.6\n\nbuilding\n\n61.2\n69.8\n64.8\n79.2\n78.1\n73.6\n\nface\n69.0\n80.5\n74.4\n86.3\n83.5\n89.5\n\nbike\n54.1\n78.6\n77.7\n86.2\n88.3\n81.4\n\nwe set the vocabulary size to 500, the number of hidden topics to 50, and each symmetric Dirichlet\nparameter \u03b1k = 0.5, based on validation performance. For Model II, we de\ufb01ne the neighborhood\nof each site i as a subset of sites that falls into a circular region centered at i and with radius of 2\u03c3,\nwhere \u03c3 is the fall-off rate of the weights. We set \u03c3 to be 10 pixels, which is roughly 1/20 of image\nsize. The classi\ufb01ers for label prediction have 15 hidden units. The appearance model for topics\nand the classi\ufb01er are initialized randomly. In the learning procedure, the E step uses 500 samples\nto estimate the posterior distribution of topics. In the M step, we take 3 steps of gradient ascent\nlearning of the classi\ufb01ers per iteration.\n\nThe performance of our models is \ufb01rst evaluated on the dataset with all the labels available. We\ncompare the performance of the three model variants to the super-pixel classi\ufb01er (S Class), and the\nCRF model. Table 1 shows the average classi\ufb01cation accuracy rates of our model and the baselines\nfor each class and in total, over 10 different random partitions of the dataset. We can see that Model\nI, which uses latent feature representations as additional inputs, achieves much better performance\nthan the S Class. Also, Model II and III improve the accuracy further by incorporating the label\nspatial priors. We notice that the lateral connections between label variables are more effective than\nintegrating information from neighboring latent topic variables. This is also demonstrated by the\ngood performance of the simple CRF.\nLearning with different amounts of label data. In order to test the robustness of the latent feature\nrepresentation, we evaluate our models using data with different amount of labeling information.\nWe use an image dilation operator on the image regions labeled as \u2018void\u2019, and control the proportion\nof labeled data by varying the diameters of the dilation operator (see [16] for similar processing).\nSpeci\ufb01cally, we use diameter values of 5, 10, 15, 20, 25, 30 and 35 to change the proportion of the\nlabeled pixels to 62.9%, 52.1%, 44.1%, 36.4%, 30.5%, 24.9% and 20.3%, respectively. The original\nproportion is 71.9%. We report the average accuracies of 5 runs of training and testing with random\nequal partition of the dataset in Figure 2. The \ufb01gure shows that the performance of all three models\ndegrades with fewer labeled data, but the degradation is relatively gradual. When the proportion of\nlabeled data decreases from 72% to 20%, the total loss in accuracy is less than 10%. This suggests\nthat incorporating latent features makes our models more robust against missing labels than the\nprevious work (cf. [16]). We also note that the performance of Model III is more robust than the\nother two variants, which may derive from stronger smoothing.\n\nTable 2: A comparison of classi\ufb01cation accuracy of our three model variants with other methods on\nthe full MSRC dataset and Corel-B dataset.\n\nS Class Model I Model II Model III\n\nMSRC\nCorel-B\n\n60.0\n68.2\n\n65.9\n69.2\n\n72.3\n73.4\n\n74.0\n75.5\n\n[14]\n72.2\n\n-\n\n[5]\n-\n\n75.3\n\nPerformance on other sets. We further evaluate our models on two larger datasets to see whether\nthey can scale up. The \ufb01rst dataset is the full version of the MSRC dataset, and we use the same\ntraining/testing partition as in [14]. The model setting is the same as in MSRC-9 except that we use\na MLP with 20 hidden units for label prediction. The second is the Corel-B dataset, which is divided\ninto 175 training images and 130 testing images randomly. We use the same setting of the models\nas in the experiments on the full MSRC set. Table 2 summarizes the classi\ufb01cation accuracies of our\nmodels as well as some previous methods. For the full MSRC set, the two extended versions of our\nmodel achieve the similar performance as in [14], and we can see that the latent topic representation\n\n6\n\n\fy\nc\na\nr\nu\nc\nc\nA\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n \n\n71.9\n\n62.8\n\n52.1\n30.5\nPercentage of Labeled Pixels\n\n36.4\n\n44.1\n\n \n\nS_Class\nModel\u2212I\nModel\u2212II\nModel\u2212III\n\n24.9\n\n20.3\n\nFigure 2: Left: Classi\ufb01cation Accuracy with gradually decreasing proportion of labeled pixels.\nRight top: Examples of an image and its super-pixelization. Right bottom: Examples of original\nlabeling and labeling after dilation (the ratio is 36.4).\nprovides useful cues. Also, our models have the same accuracy as reported in [5] on the Corel-B\ndataset, while we have a simpler label random \ufb01eld and use a smaller training set. It is interesting to\nnote that the topics and spatial smoothness play less roles in the labeling performance on Corel-\nB. Figure 3 shows some examples of labeling results from both datasets. We can see that our\nmodels handle the extended regions better than those \ufb01ne object structures, due to the tendency\nof (over)smoothing caused by super-pixelization and the two spatial dependency structures.\n\n6 Discussion\n\nIn this paper, we presented a hybrid framework for image labeling, which combines a genera-\ntive topic model with discriminative label prediction models. The generative model extends latent\nDirichlet allocation to capture joint patterns in the label and appearance space of images. This la-\ntent representation of an image then provides an additional input to the label predictor. We also\nincorporated the spatial dependency into the model structure in two different ways, both imposing a\nprior of spatial smoothness for labeling on the image plane. The results of applying our methods to\nthree different image datasets suggest that this integrated approach may extend to a variety of image\ndatabases with only partial labeling available. The labeling system consistently out-performs alter-\nnative approaches, such as a standard classi\ufb01er and a standard CRF. Its performance also matches\nthat of the state-of-the-art approaches, and is robust against different amount of missing labels.\n\nSeveral avenues exist for future work. First, we would like to understand when the simple \ufb01rst-order\napproximation in inference for Model II holds, e.g., when the local curvature of the classi\ufb01er with\nrespect to its input is large. In addition, it is important to address model selection issues, such as\nthe number of topics. We currently rely on the validation set, but more principled approaches are\npossible. A \ufb01nal issue concerns the reliance on visual words formed by clustering features in a\ncomplicated appearance space. Using a stronger appearance model may help us understand the role\nof different visual cues, as well as construct a more powerful generative model.\n\nReferences\n[1] Yasemin Altun, David McAllester, and Mikhail Belkin. Maximum margin semi-supervised learning for\n\nstructured variables. In NIPS 18, 2006.\n\n[2] David Blei and Jon McAuliffe. Supervised topic models. In NIPS 20, 2008.\n[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res.,\n\n3:993\u20131022, 2003.\n\n[4] Xuming He, Richard Zemel, and Miguel Carreira-Perpinan. Multiscale conditional random \ufb01elds for\n\nimage labelling. In CVPR, 2004.\n\n[5] Xuming He, Richard S. Zemel, and Debajyoti Ray. Learning and incorporating top-down cues in image\n\nsegmentation. In ECCV, 2006.\n\n[6] Michael Kelm, Chris Pal, and Andrew McCallum. Combining generative and discriminative methods for\n\npixel classi\ufb01cation with multi-conditional learning. In ICPR, 2006.\n\n7\n\n\fMSRC\n\nCorel-B\n\nBuilding\n\nGrass\n\nTree\n\nCow\n\nSky\n\nPlane\n\nFace\n\nCar\n\nBike\n\ne\ng\na\nm\n\nI\n\ng\ni\nr\n\nO\n\nl\ne\nd\no\nM\n\nr\nu\nO\n\nh\nt\nu\nr\nT\nd\nn\nu\no\nr\nG\n\nHippo/Rhino\n\nHorse\n\nTigher\n\nPolar Bear\n\nWolf/Lepard\n\nWater\n\nVegetaion\n\nSky\n\nGround\n\nSnow\n\nFence\n\ne\ng\na\nm\n\nI\n\ng\ni\nr\n\nO\n\nl\ne\nd\no\nM\n\nr\nu\nO\n\nh\nt\nu\nr\nT\nd\nn\nu\no\nr\nG\n\nFigure 3: Some labeling results for the Corel-B (bottom panel) and MSRC-9 (top panel) datasets,\nbased on the best performance of our models. The \u2018Void\u2019 region is annotated by color \u2018black\u2019.\n\n[7] Sanjiv Kumar and Martial Hebert. Discriminative random \ufb01elds: A discriminative framework for contex-\n\ntual interaction in classi\ufb01cation. In ICCV, 2003.\n\n[8] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Probabilistic models\n\nfor segmenting and labeling sequence data. In ICML, pages 282\u2013289, 2001.\n\n[9] Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and\n\ndiscriminative models. In CVPR, 2006.\n\n[10] Chi-Hoon Lee, Shaojun Wang, Feng Jiao, Dale Schuurmans, and Russell Greiner. Learning to model\n\nspatial dependency: Semi-supervised discriminative random \ufb01elds. In NIPS 19, 2007.\n\n[11] Nicolas Loeff, Himanshu Arora, Alexander Sorokin, and David Forsyth. Ef\ufb01cient unsupervised learning\n\nfor localization and detection in object categories. In NIPS, 2006.\n\n[12] B. Russell, A. Torralba, K. Murphy, and W. Freeman. LabelMe: A database and web-based tool for image\n\nannotation. Technical report, MIT AI Lab Memo AIM-2005-025, 2005.\n\n[13] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 2000.\n[14] Jamie Shotton, John M. Winn, Carsten Rother, and Antonio Criminisi. Textonboost: Joint appearance,\n\nshape and context modeling for multi-class object recognition and segmentation. In ECCV, 2006.\n\n[15] Jakob Verbeek and Bill Triggs. Region classi\ufb01cation with markov \ufb01eld aspect models. In CVPR, 2007.\n[16] Jakob Verbeek and Bill Triggs. Scene segmentation with CRFs learned from partially labeled images. In\n\nNIPS 20, 2008.\n\n[17] Gang Wang, Ye Zhang, and Li Fei-Fei. Using dependent regions for object categorization in a generative\n\nframework. In CVPR, 2006.\n\n[18] Xiaogang Wang and Eric Grimson. Spatial latent Dirichlet allocation. In NIPS, 2008.\n\n8\n\n\f", "award": [], "sourceid": 693, "authors": [{"given_name": "Xuming", "family_name": "He", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}