{"title": "Portmanteau Vocabularies for Multi-Cue Image Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 1323, "page_last": 1331, "abstract": "We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-the-art results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation", "full_text": "Portmanteau Vocabularies for Multi-Cue Image\n\nRepresentation\n\nFahad Shahbaz Khan1, Joost van de Weijer1, Andrew D. Bagdanov1,2, Maria Vanrell1\n\n1Centre de Visio per Computador, Computer Science Department\n\n1Universitat Autonoma de Barcelona, Edifci O, Campus UAB (Bellaterra), Barcelona, Spain\n\n2 Media Integration and Communication Center, University of Florence, Italy\n\nAbstract\n\nWe describe a novel technique for feature combination in the bag-of-words model\nof image classi\ufb01cation. Our approach builds discriminative compound words from\nprimitive cues learned independently from training images. Our main observation\nis that modeling joint-cue distributions independently is more statistically robust\nfor typical classi\ufb01cation problems than attempting to empirically estimate the de-\npendent, joint-cue distribution directly. We use Information theoretic vocabulary\ncompression to \ufb01nd discriminative combinations of cues and the resulting vocab-\nulary of portmanteau1 words is compact, has the cue binding property, and sup-\nports individual weighting of cues in the \ufb01nal image representation. State-of-the-\nart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets\ndemonstrate the effectiveness of our technique compared to other, signi\ufb01cantly\nmore complex approaches to multi-cue image representation.\n\n1\n\nIntroduction\n\nImage categorization is the task of classifying an image as containing an objects from a prede\ufb01ned\nlist of categories. One of the most successful approaches to this problem is the bag-of-words (BOW)\n[4, 15, 11, 2]. In the bag-of-words model an image is \ufb01rst represented by a collection of local image\nfeatures detected either sparsely or in a regular, dense grid. Each local feature is then represented\nby one or more cues, each describing one aspect of a small region around the corresponding feature.\nTypical local cues include color, shape, and texture. These cues are then quantized into visual words\nand the \ufb01nal image representation is a histogram over these visual vocabularies. In the \ufb01nal stage of\nthe BOW approach the histogram representations are sent to a classi\ufb01er.\nThe success of BOW is highly dependent on the quality of the visual vocabulary. In this paper we\ninvestigate visual vocabularies which are used to represent images whose local features are described\nby both shape and color. To extend BOW to multiple cues, two properties are especially important:\ncue binding and cue weighting. A visual vocabulary is said to have the binding property when two\nindependent cues appearing at the same location in an image remain coupled in the \ufb01nal image\nrepresentation. For example, if every local patch in an image is independently described by a shape\nword and a color word, in the \ufb01nal image representation using compound words the binding property\nensures that shape and color words coming from the same feature location are coupled in the \ufb01nal\nrepresentation. The term binding is borrowed from the neuroscience \ufb01eld where it is used to describe\nthe way in which humans select and integrate the separate cues of objects in the correct combinations\nin order to accurately recognize them [17]. The property of cue weighting implies that it is possible\n\n1A portmanteau is a combination of two or more words to form a neologism that communicates a concept\nbetter than any individual word (e.g. Ski resort + Konference = Skonference). We use the term to describe our\nvocabularies to emphasize the connotation with combining color and shape words into new, more meaningful\nrepresentations.\n\n1\n\n\fto adapt the relevance of each cue depending on the dataset. The importance of cue weighting can\nbe seen from the success of Multiple Kernel Learning (MKL) techniques where weights for each\ncue are automatically learned [3, 13, 21, 14, 1, 20].\nTraditionally, two approaches exist to handle multiple cues in BOW. When each cue has its own\nvisual vocabulary the result is known as a late fusion image representation in which an image is\nrepresented as one histogram over shape-words and another histogram over color-words. Such a\nrepresentation does not have the cue binding property, meaning that it is impossible to know exactly\nwhich color-shape events co-occurred at local features. Late fusion does, however, allow cue weight-\ning. Another approach, called early fusion, constructs a single visual vocabulary of joint color-shape\nwords. Representations over early fusion vocabularies have the cue binding property, meaning that\nthe spatial co-occurrence of shape and color events is preserved. However, cue weighting in early\nfusion vocabularies is very cumbersome since must be performed before vocabulary construction\nmaking cross-validation very expensive. Recently, Khan et al. [10] proposed a method which com-\nbines cue binding and weighting. However, their \ufb01nal image representation size is equal to number\nof vocabulary words times the number of classes, and is therefore not feasible for the large data sets\nconsidered in this paper.\nA straightforward, if combinatorially inconvenient, approach to ensuring the binding property is to\ncreate a new vocabulary that contains one word for each combination of original shape and color\nfeature. Considering that each of the original shape and color vocabularies may contain thousands of\nwords, the resulting joint vocabulary may contain millions. Such large vocabularies are impractical\nas estimating joint color-shape statistics is often infeasible due to the dif\ufb01culty of sampling from\nlimited training data. Furthermore, with so many parameters the resulting classi\ufb01ers are prone to\nover\ufb01tting. Because of this and other problems, this type of joint feature representation has not been\nfurther pursued as a way of ensuring that image representations have the binding property.\nIn recent years a number of vocabulary compression techniques have appeared that derive small,\ndiscriminative vocabularies from very large ones [16, 7, 5]. Most of these techniques are based on\ninformation theoretic clustering algorithms that attempt to combine words that are equivalently dis-\ncriminative for the set of object categories being considered. Because these techniques are guided by\nthe discriminative power of clusters of visual words, estimates of class-conditional visual word prob-\nabilities are essential. These recent developments in vocabulary compression allow us to reconsider\nthe direct, Cartesian product approach to building compound vocabularies.\nThese vocabulary compression techniques have been demonstrated on single-cue vocabularies with\na few tens of thousands of words. Starting from even moderately sized shape and color vocabularies\nresults in a compound shape-color vocabulary an order of magnitude larger. In such cases, robust\nestimates of the underlying class-conditional joint-cue distributions may be dif\ufb01cult to obtain. We\nshow that for typical datasets a strong independence assumption about the joint color-shape distri-\nbution leads to more robust estimates of the class-conditional distributions needed for vocabulary\ncompression. In addition, our estimation technique allows \ufb02exible cue-speci\ufb01c weighting that can-\nnot be easily performed with other cue combination techniques that maintain the binding property.\n\n2 Portmanteau vocabularies\n\nIn this section we propose a new multi-cue vocabulary construction method that results in com-\npact vocabularies which possess both the cue binding and the cue weighting properties described\nabove. Our approach is to build portmanteau vocabularies of discriminative, compound shape and\ncolor words chosen from independently learned color and shape lexicons. The term portmanteau\nis used in natural language for words which are a blend of two other words and which combine\ntheir meaning. We use the term portmanteau to describe these compound terms to emphasize the\nfact that, similarly to the use of neologistic portmanteaux in natural language to capture complex\nand compound concepts, we create groups of color and shape words to describe semantic concepts\ninadequately described by shape or color alone.\nA simple way to ensure the binding property is by considering a product vocabulary that contains\na new word for every combination of shape and color terms. Assume that S = {s1, s2, ..., sM}\nand C = {c1, c2, ..., cN} represent the visual shape and color vocabularies, respectively. Then the\n\n2\n\n\fFigure 1: Comparison of two estimates of the joint cue distribution p(S, C|R) on two large datasets.\nThe graphs plot the Jenson-Shannon divergence between each estimate and the true joint distribution\nas a functions of the number of training images used to estimate them. The true joint distribution is\nestimated empirically over all images in each dataset. Estimation using the independence assump-\ntion of equation (2) yields similar or better estimates than their empirical counterparts.\n\nproduct vocabulary is given by\n\nW = {w1, w2, ..., wT} = {{si, cj} | 1 \u2264 i \u2264 M, 1 \u2264 j \u2264 N},\n\nwhere T = M \u00d7 N. We will also use the the notation sm to identify a member from the set S.\nA disadvantage of vocabularies of compound terms constructed by considering the Cartesian product\nof all primitive shape and color words is that the total number of visual words is equal to the number\nof color words times the number of shape words, which typically results in hundreds of thousands of\nelements in the \ufb01nal vocabulary. This is impractical for two reasons. First, the high dimensionality\nof the representation hampers the use of complex classi\ufb01ers such as SVMs. Second, insuf\ufb01cient\ntraining data often renders robust estimation of parameters very dif\ufb01cult and the resulting classi\ufb01ers\ntend to over\ufb01t the training set. Because of these drawbacks, compound product vocabularies have,\nto the best of our knowledge, not been pursued in literature. In the next two subsections we discuss\nour approach to overcoming these two drawbacks.\n\n2.1 Compact Portmanteau Vocabularies\n\nIn recent years, several algorithms for feature clustering have been proposed which compress large\nvocabularies into small ones [16, 7, 5]. To reduce the high-dimensionality of the product vocabulary,\nwe apply Divisive Information-Theoretic feature Clustering (DITC) algorithm [5], which was shown\nto outperform AIB [16]. Furthermore, DITC has also been successfully employed to construct\ncompact pyramid representations [6].\nThe DITC algorithm is designed to \ufb01nd a \ufb01xed number of clusters which minimize the loss in\nmutual information between clusters and the class labels of training samples. In our algorithm, loss\nin mutual information is measured between original product vocabulary and the resulting clusters.\nThe algorithm joins words which have similar discriminative power over the set of classes in the\nimage categorization problem. This is measured by the probability distributions p (R|wt), where\nR = {r1, r2, ..rL} is the set of L classes.\nMore precisely, the drop in mutual information I between the vocabulary W and the class labels\nR when going from the original set of vocabulary words W to the clustered representation W R =\n{W1, W2, ..., WJ} (where every Wj represents a cluster of words from W ) is equal to\n\np (wt) KL (p (R|wt)|| p (R|Wj)),\n\n(1)\n\nI (R; W ) \u2212 I(cid:0)R; W R(cid:1) =\n\nJ(cid:88)\n\n(cid:88)\n\nj=1\n\nwt\u2208Wj\n\nwhere KL is the Kullback-Leibler divergence between two distributions. Equation (1) states that the\ndrop in mutual information is equal to the prior-weighted KL-divergence between a word and its\nassigned cluster. The DITC algorithm minimizes this objective function by alternating computation\n\n3\n\n0,00E+002,00E-064,00E-066,00E-068,00E-061,00E-051,20E-051,40E-051,60E-050246810121416Bird-200Direct EmpiricalIndependence Assumption0,00E+002,00E-064,00E-066,00E-068,00E-061,00E-050246810121416182022Flower-102Direct EmpiricalIndependence Assumption\fFigure 2: The effect of \u03b1 on DITC clusters. Each of the large boxes contains 100 image patches\nsampled from one Portmanteau word on the Oxford Flower-102 dataset. Top row: \ufb01ve clusters\nfor \u03b1 = 0.1. Note how these clusters are relatively homogeneous in color, while shape varies\nconsiderably within each. Middle row: \ufb01ve clusters sampled for \u03b1 = 0.5. The clusters show\nconsistency over both color and shape. Bottom row: \ufb01ve clusters sampled for \u03b1 = 0.9. Notice how\nin this case shape is instead homogeneous within each cluster.\n\nof the cluster distributions and assignment of compound visual words to their closest cluster. For\nmore details on the DITC algorithm we refer to Dhillon et al. [5]. Here we apply the DITC algorithm\nto reduce the high-dimensionality of the compound vocabularies. We call the compact vocabulary\nwhich is the output of the DITC algorithm the portmanteau vocabulary and its words accordingly\nportmanteau words. The \ufb01nal image representation p(W R) is a distribution over the portmanteau\nwords.\n\n2.2\n\nJoint distribution estimation\n\nIn solving the problem of high-dimensionality of the compound vocabularies we seemingly fur-\nther complicated the estimation problem. As DITC is based on estimates of the class-conditional\ndistributions p(S, C|R) = p(W|R) over product vocabularies, we have increased the number of\nparameters to be estimated to M \u00d7 N \u00d7 L. This can easily reach millions of parameters for standard\nimage datasets. To solve this problem we propose to estimate the class conditional distributions by\nassuming independence of color and shape, given the class:\n\np(sm, cn|R) \u221d p(sm|R)p(cn|R).\n\n(2)\n\nNote that we do not assume independence of the cues themselves, but rather the less restrictive in-\ndependence of the cues given the class. Instead of directly estimating the empirical joint distribution\np(S, C|R), we reduce the number of parameters to estimate to (M + N ) \u00d7 L, which in the vo-\ncabulary con\ufb01gurations discussed in this paper represents a reduction in complexity of two orders\nof magnitude. As an additional advantage, we will show in section 2.3 that estimating the joint\ndistribution p(S, C|R) allows us to introduce cue weighting.\nTo verify the quality of the empirical estimates of equation (2) we perform the following experiment.\nIn \ufb01gure 1 we plot the Jensen-Shannon (JS) divergence between the empirical joint distribution ob-\ntained from the test images and the two estimates: direct estimation of the empirical joint distribution\np(S, C|R) on the training set, and an approximate estimate made by assuming independence as in\n\n4\n\n\fFigure 3: The effect of \u03b2 on DITC clusters. For 20 words p (R|wt) is plotted in dotted grey lines.\nDITC is used to obtain ten portmanteau means p (R|Wj) are plotted in different colors. On the\nleft is shown the \ufb01nal clustering for \u03b2 = 1.0. Note that none of the portmanteau means are espe-\ncially discriminative for one particular class. On the right, however, for \u03b2 = 5.0 each portmanteau\nconcentrates on discriminating one class.\n\nequation (2). Results are provided as a function of the number of training images for two large\ndatasets. A low JS-divergence means a better estimate of the true joint-cue distribution. The plot-\nted lines show the curves for a color cue vocabulary of 100 words and a shape cue vocabulary of\n5,000 words, resulting in a product vocabulary of 500,000 words. On both datasets we see that the\nindependence assumption actually leads to a better or equally good estimate of the joint distribution.\nIncreasing the number of training samples, or starting with smaller color and shape vocabularies\nand hence reducing the number of parameters to estimate, will improve direct empirical estimates\nof p(S, C). However, \ufb01gure 1 shows that for typical vocabulary settings on large datasets the inde-\npendence assumption results in equivalently good or better estimates of the joint distribution.\n\n2.3 Cue weighting\n\nConstructing the compact portmanteau vocabularies based on the independence assumption signi\ufb01-\ncantly reduces the number of parameters to estimate. Furthermore, as we will see in this section, it\nallows us to control the relative contribution of color and shape cues in the \ufb01nal representation.\nWe introduce a weighting parameter \u03b1 \u2208 [0, 1] in the estimate of p(C, S):\n\np\u03b1(sm, cn|R) \u221d p(sm|R)\u03b1p(cn|R)1\u2212\u03b1\n\n(3)\n\nwhere an \u03b1 close to zero results in a larger in\ufb02uence of the color words, and a \u03b1 close to one leads\nto a vocabulary which focuses predominantly on shape.\nTo illustrate the in\ufb02uence of \u03b1 on the vocabulary construction, we show samples from portmanteau\nwords obtained on the Oxford Flower-102 dataset (see \ufb01gure 4) in \ufb01gure 2. The DITC algorithm is\napplied to reduce the product vocabulary of 500,000 compound words to 100 portmanteau words.\nFor settings of \u03b1 \u2208 {0.1, 0.5, 0.9} we show \ufb01ve of the hundred words. Each word is represented by\none hundred randomly sampled patches from the dataset which have been assigned to the word. The\neffect of changing the \u03b1 can be clearly seen. For low \u03b1 the Portmanteau words exhibit homogeneity\nof color but lack within-cluster shape consistency. On the other hand for high \u03b1 the words show\nstrong shape homogeneity such as low and high frequency lines and blobs, while color is more\nuniformly distributed. For a setting of \u03b1 = 0.5 the clustering is more consistent in both color and\nshape.\nAdditionally, another parameter \u03b2 is introduced:\n\np\u03b1,\u03b2(sm, cn|R) \u221d(cid:0)p(sm|R)\u03b1p(cn|R)1\u2212\u03b1(cid:1)\u03b2\n\n(4)\n\nTo illustrate the in\ufb02uence of \u03b2 consider the following experiment on synthetic data. We generate a\nset of 100 words which have random discriminative power p (R|wt) over L = 10 classes. In \ufb01gure 3\n\n5\n\n1234567891000.050.10.150.20.25beta=1.0000001234567891000.050.10.150.20.25beta=5.000000classes Rclasses Rp(R|w)p(R|w)\fFigure 4: Example images from the two datasets used in our experiments Top: images from four\ncategories of the Flower-102 dataset. Bottom: four example images from the Bird-200 dataset.\n\nwt\u2208Wj\n\nwe show the p (R|wt) for a subset of 20 words in grey, and p (R|Wj) \u221d (cid:80)\n\np(wt)p(R|wt) for\nthe ten portmanteau words in color. We observe that increasing the \u03b2 parameter directs DITC to\n\ufb01nd clusters which are each highly discriminative for a single class, rather than being discriminative\nover all classes. We found that higher \u03b2 values often lead to image representations which improve\nclassi\ufb01cation results.\nThese weighting parameters are learned through cross validation on the training set. In practice we\nfound \u03b1 to change with the data set according to the importance of color and shape. The \u03b2 parameter\nwas found to to be constant at a value 5 for the two datasets evaluated in this paper. Both parameters\nwere found to signi\ufb01cantly improve results on both datasets.\n\n2.4\n\nImage representation with portmanteau vocabularies\n\nWe summarize our approach to constructing portmanteau vocabularies for image representation.\nWe emphasize the fact that our approach is fundamentally about deriving compact multi-cue image\nrepresentations and, as such, can be used as a drop-in replacement in any bag-of-words pipeline.\nImage representation by portmanteau vocabulary built from color and shape cues follows these steps:\n\n1. Independent color and shape vocabularies are constructed by standard K-means clustering\nover color and shape descriptors extracted from training images.\n2. Empirical class-conditional word distributions p(S|R) and p(C|R) are computed from the\ntraining set, the joint cue distribution P (S, C|R) is estimated assuming conditional inde-\npendence as in equation (4).\n\n3. The portmanteau vocabulary is computed with the DITC algorithm. The output of the\nDITC is a list of indexes which, for each member of the compound vocabulary maps to one\nof the J portmanteau words.\n\n4. Using the index list output by DITC, the original image features are revisited and the index\ncorresponding the compound shape-color word at each feature is used to represent each\nimage as a histogram over the portmanteau vocabulary.\n\n3 Experimental results\n\nWe follow the standard bag-of-words approach. We use a combination of interest-point detectors\nalong with a dense multi-scale grid detector. The SIFT descriptor [12] is used to construct a shape\nvocabulary. For color we use the color name descriptor, which is computed by converting sRGB\nvalues to color names according to [19] after which each patch is represented as a histogram over\nthe eleven color names. The shape and color vocabularies are constructed using the standard K-\nmeans algorithm. In all our experiments we use a shape vocabulary of 5000 words and a color\nvocabulary of 100 words. Applying Laplace weighting was not found to in\ufb02uence the results and\n\n6\n\n\ftherefore not used in the experiments. The classi\ufb01er is a non-linear, multi-way, one-versus-all SVM\nusing the \u03c72 kernel [24]. Each test image is assigned the label of the classi\ufb01er giving the highest\nresponse and the \ufb01nal classi\ufb01cation score is the mean recognition rate per category.\nWe performed several experiments to validate our approach to building multi-cue vocabularies by\ncomparing with other methods which are based on exactly the same initial SIFT and CN descriptors:\n\u2022 Shape and Color only: a single vocabulary of 5000 SIFT words and one of 100 CN words.\n\u2022 Early fusion: SIFT and CN are concatenated into single descriptor. The relative weight\nof shape and color is optimized by cross-validation. Note that cross-validation on cue\nweighting parameters for early fusion must be done over the entire BOW pipeline, from\nvocabulary construction to classi\ufb01cation. Vocabulary size is 5000.\n\n\u2022 Direct empirical: DITC based on the empirical distribution of p(S, C|R) over a total of\n\n500.000 compound words estimated on the training set.\n\n\u2022 Independence assumption: where p(S, C|R) = p(S|R)p(C|R) is assumed. We also\n\nshow separate results with and without using \u03b1 and \u03b2.\n\nIn all cases the color-shape visual vocabularies are compressed to 500 visual words and spatial pyra-\nmids are constructed for the \ufb01nal image representation as in [11]. All of the above approaches were\nevaluated on two standard and challenging datasets: Oxford Flower-102 and Caltech-UCSD Bird-\n200. The train-test splits are \ufb01xed for both datasets and are provided on their respective websites.2\n\n3.1 Results on the Flower-102 and Bird-200 datasets\n\nThe Oxford Flower-102 dataset contains 8189 images of 102 different \ufb02ower species. It is a chal-\nlenging dataset due to signi\ufb01cant scale and illumination changes (see \ufb01gure 4). The results are\npresented in table 1(a). We see that shape alone yields results superior to color. Early fusion is\nreasonably good at 70.5%. This is however obtained through laborious cross validation to obtain\nthe optimal balance between CN and SIFT cues. Since our cue weighting is done after the initial\nvocabulary and histogram construction, cross-validation is signi\ufb01cantly faster than for early fusion.\nThe bottom three rows of table 1(a) give the results of our approach to image representation with\nportmanteau vocabularies in a variety of con\ufb01gurations. The direct empirical estimation of the joint\nshape-color distribution provides slightly better results than estimation based on the independence\nassumption. However, weighting the two visual cues using the \u03b1 parameter described in equation (3)\nin the independent estimation of p(s, c|class) improves the results signi\ufb01cantly. In particular, the\ngain of almost 7% obtained by adding \u03b2 is remarkable. The best recognition performance were\nobtained for \u03b1 = 0.8 and \u03b2 = 5.\nThe Caltech-UCSD Bird-200 dataset contains 6033 images from 200 different bird species. This\ndataset contains many bird species that closely resemble each other in terms of color and shape cues,\nmaking the recognition task extremely dif\ufb01cult. Table 1(a) contains test results for our approach on\nBird-200 as well.\nInterestingly, on this dataset color outperforms shape alone and early fusion\nyields only a small improvement over color. Results based on portmanteau vocabularies outperform\nearly fusion, and estimation based on the independence assumption provide better results than direct\nempirical estimation. These results are further improved by the introduction of cue weighting with\na \ufb01nal score of 22.4% obtained with \u03b1 = 0.7 and \u03b2 = 5 outperforming all others.\n\n3.2 Comparison with the state-of-the-art\n\nRecently, an extensive performance evaluation of color descriptors was presented by van de Sande\net al. [18].\nIn this evaluation the OpponentSIFT and C-SIFT were reported to provide superior\nperformance on image categorization problems. We construct a visual vocabulary of 5000 visual\nwords for both OpponentSIFT and C-SIFT and apply the DITC algorithm to compress it to 500\nvisual words. As shown in table 1(b), Our approach provides signi\ufb01cantly better results compared\nto both OpponentSIFT and C-SIFT, possibly due to the fact neither supports cue weighting.\n\n2The Flower-102 dataset at http://www.robots.ox.ac.uk/vgg/research/flowers/ and the\n\nBirds-200 set at http://www.vision.caltech.edu/visipedia/CUB-200.html\n\n7\n\n\fMethod\nShape only\nColor only\nEarly Fusion\nDirect empirical\nIndependent\nIndependent + \u03b1\nIndependent + \u03b1 + \u03b2\n\nFlower-102 Bird-200\n\n60.7\n48.5\n70.5\n64.6\n63.5\n66.4\n73.3\n\n(a)\n\n12.9\n16.8\n17.0\n18.9\n19.8\n21.6\n22.4\n\nMethod Bird-200\n\nOpponentSIFT\nC-SIFT\nMKL [13]\nMKL [3]\nRandom Forest [23]\nSaliency [9]\nOur Approach\n\n14.0\n13.9\n\u2212\n19.0\n19.2\n\u2212\n22.4\n\n(b)\n\nFlower-102\n\n69.2\n65.9\n72.8\n\u2212\n\u2212\n71.0\n73.3\n\nTable 1: Comparative evaluation of our approach. (a) Classi\ufb01cation score on Flower-102 and Bird-\n200 datasets for individual features, early fusion and several con\ufb01gurations of our approach. (b)\nComparison of our approach to the state-of-the-art on the Bird-200 and Flower-102 datasets.\n\nIn recent years, combining multiple cues using Multiple Kernel Learning (MKL) techniques has\nreceived a lot of attention. These approaches combine multiple cues and multiple kernels and apply\nper-class cue weighting. Table 1(b) includes two recent MKL techniques that report state-of-the-art\nperformance. The technique described in [3] is based on geometric blur, grayscale SIFT, color SIFT\nand full image color histograms, while the approach in [13] also employs HSV, SIFT int, SIFT bd,\nand HOG descriptors in the MKL framework of [21]. Despite the simplicity of our approach, which\nis based on only two cues and a single kernel, it outperforms these complex multi-cue learning\ntechniques. Also note that both MKL approaches are based on learning class-speci\ufb01c weighting for\nmultiple cues. This is especially cumbersome when there exist several hundred object categories in\na dataset (e.g. the Bird-200 dataset contains 200 bird categories). In contrast to these approaches,\nwe learn a global, class-independent cue weighting parameters to balance color and shape cues.\nOn the Flower-102 dataset, our \ufb01nal classi\ufb01cation score of 73.3% is comparable to the state-of-the-\nart recognition performance [13, 9, 8]3 obtained on this dataset. It should be noted that Nilsback\nand Zisserman [13] obtain a classi\ufb01cation performance of 72.8% using segmented images and a\ncombination of four different visual cues in a multiple kernel learning framework. Our performance,\nhowever, is obtained on unsegmented images using only color and shape cues. On the Bird-200\ndataset, our approach signi\ufb01cantly outperforms state-of-the-art methods [23, 3, 22].\n\n4 Conclusions\n\nIn this paper we propose a new method to construct multi-cue, visual portmanteau vocabularies\nthat combine color and shape cues. When constructing a multi-cue vocabulary two properties are\nespecially desirable: cue binding and cue weighting. Starting from multi-cue product vocabularies\nwe compress this representation to form discriminative compound terms, or portmanteaux, used in\nthe \ufb01nal image representation. Experiments demonstrate that assuming independence of visual cues\ngiven the categories provides a robust estimation of joint-cue distributions compared to direct em-\npirical estimation. Assuming independence also has the advantage of both reducing the complexity\nof the representation by two orders of magnitude and allowing \ufb02exible cue weighting. Our \ufb01nal im-\nage representation is compact, maintains the cue binding property, admits cue weighting and yields\nstate-of-the-art performance on the image categorization problem.\nWe tested our approach on two datasets, each with more than one hundred object categories. Results\ndemonstrate the superiority of our approach over existing ones combining color and shape cues. We\nobtain a gain of 2.8% and 5.4% over the early fusion approach. Our approach also outperforms\nmethods based on multiple cues and MKL with per-class parameter learning. This leaves open the\npossibility of using our approach to multi-cue image representation within an MKL framework.\nAcknowledgments: This work is supported by the EU project ERG-TS-VICI-224737; by the Span-\nish Research Program Consolider-Ingenio 2010: MIPRCV (CSD200700018); by the Tuscan Re-\ngional project MNEMOSYNE (POR-FSE 2007-2013, A.IV-OB.2); and by the Spanish projects\nTIN2009-14173, TIN2010-21771-C02-1. Joost van de Weijer acknowledges the support of a Ra-\nmon y Cajal fellowship.\n\n3From correspondence with the authors of [8] we learned that the results reported in their paper are erro-\n\nneous and they do not obtain results better than [13].\n\n8\n\n\fReferences\n[1] Francis Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, 2008.\n[2] A. Bosch, A. Zisserman, and X. Munoz. Scene classi\ufb01cation via plsa. In ECCV, 2006.\n[3] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge\n\nBelongie. Visual recognition with humans in the loop. In ECCV, 2010.\n\n[4] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on\n\nStatistical Learning in Computer Vision, ECCV, 2004.\n\n[5] Inderjit Dhillon, Subramanyam Mallela, and Rahul Kumar. A divisive information-theoretic feature clus-\ntering algorithm for text classi\ufb01cation. Journal of Machine Learning Research (JMLR), 3:1265\u20131287,\n2003.\n\n[6] Noha M. El\ufb01ky, Fahad Shahbaz Khan, Joost van de Weijer, and Jordi Gonzalez. Discriminative compact\n\npyramids for object and scene recognition. Pattern Recgnition, 2011.\n\n[7] Brian Fulkerson, Andrea Vedaldi, and Stefano Soatto. Localizing objects with smart dictionaries.\n\nECCV, 2008.\n\n[8] Satoshi Ito and Susumu Kubota. Object classi\ufb01cation using hetrogeneous co-occurrence features.\n\nECCV, 2010.\n\nIn\n\nIn\n\n[9] Christopher Kanan and Garrison Cottrell. Robust classi\ufb01cation of objects, faces, and \ufb02owers using natural\n\nimage statistics. In CVPR, 2010.\n\n[10] Fahad Shahbaz Khan, Joost van de Weijer, and Maria Vanrell. Top-down color attention for object recog-\n\nnition. In ICCV, 2009.\n\n[11] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching\n\nfor recognizing natural scene categories. In CVPR, 2006.\n\n[12] D. G. Lowe. Distinctive image features from scale-invariant points. IJCV, 60(2):91\u2013110, 2004.\n[13] M-E Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes.\n\nICVGIP, 2008.\n\nIn\n\n[14] Alain Rakotomamonjy, Francis Bach, Stephane Canu, and Yves Grandvalet. More ef\ufb01ciency in multiple\n\nkernel learning. In ICML, 2007.\n\n[15] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W.Freeman. Discovering object categories in image\n\ncollections. In ICCV, 2005.\n\n[16] Noam Slonim and Naftali Tishby. Agglomerative information bottleneck. In NIPS, 1999.\n[17] Anne Treisman. Feature Binding, Attention and Object Perception. Philosophical Transactions: Biolog-\n\nical Sciences, 353(1373):1295\u20131306, 1998.\n\n[18] Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors for object\n\nand scene recognition. PAMI, 32(9):1582\u20131596, 2010.\n\n[19] J. van de Weijer, C. Schmid, Jakob J. Verbeek, and D. Larlus. Learning color names for real-world\n\napplications. IEEE Transaction in Image Processing (TIP), 18(7):1512\u20131524, 2009.\n\n[20] Manik Varma and Bodla Rakesh Babu. More generality in ef\ufb01cient multiple kernel learning. In ICML,\n\n2009.\n\n[21] Manik Varma and Debajyoti Ray. Learning the discriminative power-invariance trade-off. In ICCV, 2007.\n[22] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-\n\nconstrained linear coding for image classi\ufb01cation. In CVPR, 2010.\n\n[23] Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. Combining randomization and discrimination for \ufb01ne-\n\ngrained image categorization. In CVPR, 2011.\n\n[24] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi\ufb01cation of\n\ntexture and object catergories: A comprehensive study. IJCV, 73(2):213\u2013218, 2007.\n\n9\n\n\f", "award": [], "sourceid": 773, "authors": [{"given_name": "Fahad", "family_name": "Khan", "institution": null}, {"given_name": "Joost", "family_name": "Weijer", "institution": null}, {"given_name": "Andrew", "family_name": "Bagdanov", "institution": null}, {"given_name": "Maria", "family_name": "Vanrell", "institution": null}]}