{"title": "Unsupervised Template Learning for Fine-Grained Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 3122, "page_last": 3130, "abstract": "Fine-grained recognition refers to a subordinate level of recognition, such are recognizing different species of birds, animals or plants. It differs from recognition of basic categories, such as humans, tables, and computers, in that there are global similarities in shape or structure shared within a category, and the differences are in the details of the object parts. We suggest that the key to identifying the fine-grained differences lies in finding the right alignment of image regions that contain the same object parts. We propose a template model for the purpose, which captures common shape patterns of object parts, as well as the co-occurence relation of the shape patterns. Once the image regions are aligned, extracted features are used for classification. Learning of the template model is efficient, and the recognition results we achieve significantly outperform the state-of-the-art algorithms.", "full_text": "Unsupervised Template Learning for\n\nFine-Grained Object Recognition\n\nShulin Yang\n\nLiefeng Bo\n\nUniversity of Washington, Seattle, WA 98195\n\nISTC-PC Intel labs, Seattle, WA 98195\n\nyang@cs.washington.edu\n\nliefeng.bo@intel.com\n\nJue Wang\n\nLinda Shapiro\n\nAdobe ATL Labs, Seattle, WA 98103\n\nUniversity of Washington, Seattle, WA 98195\n\njuewang@adobe.com\n\nshapiro@cs.washington.edu\n\nAbstract\n\nFine-grained recognition refers to a subordinate level of recognition, such as rec-\nognizing different species of animals and plants. It differs from recognition of\nbasic categories, such as humans, tables, and computers, in that there are global\nsimilarities in shape and structure shared cross different categories, and the dif-\nferences are in the details of object parts. We suggest that the key to identifying\nthe \ufb01ne-grained differences lies in \ufb01nding the right alignment of image regions\nthat contain the same object parts. We propose a template model for the pur-\npose, which captures common shape patterns of object parts, as well as the co-\noccurrence relation of the shape patterns. Once the image regions are aligned,\nextracted features are used for classi\ufb01cation. Learning of the template model is\nef\ufb01cient, and the recognition results we achieve signi\ufb01cantly outperform the state-\nof-the-art algorithms.\n\n1\n\nIntroduction\n\nObject recognition is a major focus of research in computer vision and machine learning. In the last\ndecade, most of the existing work has been focused on basic recognition tasks: distinguishing differ-\nent categories of objects, such as table, computer and human. Recently, there is an increasing trend\nto work on subordinate-level or \ufb01ne-grained recognition that categorizes similar objects, such as\ndifferent types of birds or dogs, into their subcategories. The subordinate-level recognition problem\ndiffers from the basic-level tasks in that the object differences are more subtle. Fine-grained recog-\nnition is generally more dif\ufb01cult than basic-level recognition for both humans and computers, but\nit will be widely useful if successful in applications such as \ufb01sheries (\ufb01sh recognition), agriculture\n(farm animal recognition), health care (food recognition), and others.\nCognitive research study has suggested that basic-level recognition is based on comparing the shape\nof the objects and their parts, whereas subordinate-level recognition is based on comparing appear-\nance details of certain object parts [1]. This suggests that \ufb01nding the right correspondence of object\nparts is of great help in recognizing \ufb01ne-grained differences. For basic-level recognition tasks, spa-\ntial pyramid matching [2] is a popular choice that aligns object parts by partitioning the whole image\ninto multiple-level spatial cells. However, spatial pyramid matching may not be the best choice for\n\ufb01ne-grained object recognition, since falsely aligned object parts can lead to inaccurate comparisons,\nas shown in Figure 1.\nThis work is intended to alleviate the limitations of spatial pyramid matching. Our key observa-\ntion is that in a \ufb01ne-grained task, different object categories share commonality in their shape or\nstructure, and the alignment of object parts can be greatly improved by discovering such common\n\n1\n\n\fFigure 1: Region alignment by spatial pyramid matching and our approach. Spatial pyramid match-\ning partitions the whole image into regions, without considering visual appearance. A 4\u00d74 partition\nleads to misalignment of parts of the birds while a coarse partition (i.e. 2\u00d72) includes irrelevant fea-\ntures. Our approach aims to align the image regions containing the same object parts (red squares).\n\nshape patterns. For example, bird images from different species may have similar shape patterns in\ntheir beaks, tails, feet or bodies. The commonality usually is a part of the global shape, and can be\nobserved in bird images across different species and in different poses. This motivates us to decom-\npose a \ufb01ne-grained object recognition problem into two sub-problems: 1) aligning image regions\nthat contain the same object part and 2) extracting image features within the aligned image regions.\nTo this end, we propose a template model to align object parts. In our model, a template represents\na shape pattern, and the relationship between two shape patterns is captured by the relationship be-\ntween templates, which re\ufb02ects the probability of their co-occurrence in the same image. This model\nis learned using an alternative algorithm, which iterates between detecting aligned image regions,\nand updating the template model. Kernel descriptor features [3, 4] are then extracted from image\nregions aligned by the learned templates.\nOur model is evaluated on two benchmark datasets: the Caltech-UCSD Bird200 and the Stanford\nDogs. Our experimental results suggest that the proposed template model is capable of detecting\nimage regions that correspond to meaningful object parts, and our template-based algorithm outper-\nforms the state-of-the-art \ufb01ne-grained object recognition algorithms in terms of accuracy.\n\n2 Related Work\n\nAn increasing number of papers have focused on \ufb01ne-grained object recognition in recent years\n[5, 6, 1, 7, 8, 9].\nIn [5], multiple kernel learning is used to combine different types of features\nand serves as a baseline \ufb01ne-grained recognition algorithm, and human help is used to discover\nuseful attributes. In [9], a random forest is proposed for \ufb01ne-grained object recognition that uses\ndifferent depths of the tree to capture dense spatial information. In [6], a multi-cue combination\nis used to build discriminative compound words from primitive cues learned independently from\ntraining images. In [10], bagging is used to select discriminative ones from the randomly generated\ntemplates. In [11], image regions are considered as discriminative attributes and CRF is used to\nlearn the attributes on training set with human in the loop. Pose pooling [12] adapted Poselets [13]\nto \ufb01ne-grained recognition problems and learned different poses from fully annotated data. Though\ndeformable parts model [14] is powerful for object detection, it might be insuf\ufb01cient to capture the\n\ufb02exibility and variability in \ufb01ne-grained tasks considered here [15].\n\n3 Unsupervised Learning of Template Model\n\nThis section provides an overview of our \ufb01ne-grained object recognition approach. We discuss the\nframework of our template based object recognition, describe our template model, and propose an\nalternative algorithm for learning model parameters.\n\n3.1 Template-Based Fine-Grained Object Recognition\n\nOver the last decades, computer vision researchers have done a lot of work in designing effective\nand ef\ufb01cient patch-level features for object recognition [16, 17, 18, 19, 20, 21, 22]. SIFT is one\n\n2\n\n\fFigure 2: The framework for \ufb01ne-grained recognition: the recognition pipeline goes from left to\nright. In the training stage, a template model is learned from training images using Algorithm 1.\nIn the recognition stage, the learned templates are applied to each test image, resulting in aligned\nimage regions. Then image-level feature vectors are extracted as the concatenation of features of all\naligned regions. Finally, a linear SVM is used for recognition.\n\nof the most successful features, allowing an image or object to be represented as a bag of SIFT\nfeatures [16]. However, the ability of such methods is somewhat limited. Patch-level features are\ndescriptive only within spatial context constraints. For example, a cross shape can be a symbol\nfor the Red Cross, Christian religion, or Swiss Army products, depending on the larger spatial\ncontext of where it is detected. It is hard to interpret the meaning of patch-level features without\nconsidering such spatial contexts. This is even more important for a \ufb01ne-grained recognition task\nsince common features can be shared by instances from both the same and different object classes.\nSpatial pyramid models [2, 20, 23] align sub-images/parts that are spatially close by partitioning the\nwhole images into multi-level spatial cells. However the alignments produced by the spatial pyramid\nare not necessarily correct, since no displacements are allowed in the model (Figure 1).\nHere, we use a template model to \ufb01nd correctly-aligned regions from different images, so that com-\nparisons between them are more meaningful. A template represents one type of common shape\npattern of an object part, while an object part can be represented by several different templates.\nCertain shape patterns of two object parts (for instance, a head facing the left and a tail pointing to\nthe right) can frequently be observed in the same image. Our template model is designed to capture\nboth properties of templates and their relationships among templates. Model parameters are learned\nfrom a collection of unlabeled images in an unsupervised manner. See sections 3.2 and 3.3 for more\ndetails.\nOnce the templates and their relationship are learned, the \ufb01ne-grained differences can be aligned\nbased on these quantities. The framework of our template based \ufb01ne-grained object recognition is\nillustrated in Figure 2. In the learning stage, Algorithm 1 is used to \ufb01nd the templates. In the recog-\nnition stage (from left to right in Figure 2), aligned image regions are extracted from each image\nusing our template detection algorithm. Color-based, normalized color-based, gradient-based, and\nLBP-based kernel descriptors followed by EMK [4] are then applied to generate feature representa-\ntions for each region. The image-level feature is the concatenation of feature representations of all\ndetected regions from the corresponding image. Finally, a linear SVM [24] is used for recognition.\n\n3.2 Template Model\n\nWe start by de\ufb01ning a template model that represents the common shape patterns of object parts\nand their relationships. A template is an entity that contains features that will match image fea-\ntures for region detection. Let M = {T,W} be a model that contains a group of templates\nT = {T1, T2, ..., TK} and their co-occurrence relationships W = {w11 w12..., wKK}, where K\nis the number of templates, and wij is between 0 and 1. When wij = 0, the two templates Ti and Tj\nhave no co-occurrence relationship.\nWhen a template model is matched to a given image, not all templates within the model are nec-\nessarily used. This is because different templates can be associated with the same object part, but\n\n3\n\n\fone part only occurs at most once in an image. Our model captures this intuition by making the\ntemplates inactive that do not match images very well. To model appearance properties of templates\nand their relationships, the score function between templates and a given image I t should capture\nthree aspects: 1) \ufb01tness, which computes the similarity of the selected templates and image regions\nthat are most highly matched to them; 2) co-occurrence, which encourages selecting templates that\nhave a high chance of co-occurring in the same image; and 3) diversity, which gives preference to\nhaving the selected templates match separated image regions.\nFitness: We de\ufb01ne a matching score sf (Ti, xI\nan image region at location xI\nsf (Ti, xI\n\ni ) to measure the similarity between a template Ti and\n\ni in image I\ni ) = 1 \u2212 (cid:107)Ti \u2212 R(xI\n\n(1)\ni is an initial\ni and\ni | > \u03b1, the location is too far from\n\ni ) represents the features of the sub-image in I centered at the location xI\n\nwhere R(xI\nlocation associated with the template Ti and \u03b1 is an upper bound of location variation. Both xI\ni are measured by their relative location in image I. If |xI\nxI\nthe initial location, and the score is set to zero.\nThe features describing R(xI\ni ) should be able to capture common properties of object parts. Since\nthe same type of part from different objects usually share similar shapes, we introduce edge kernel\ndescriptors to capture this common statistic. We \ufb01rst run the Berkeley edge detector [25] to compute\nthe edge map of an image, and then treat it as a grayscale image and extract color kernel descrip-\ntors [3] over it. Using these descriptors, we compute sf (Ti, xI\ni ); the higher its value, the better is\nthe match.\nSumming up the matching score sf (Ti, xI\n\ufb01tness term\n\ni ) for all templates that are used for image I, we obtain a\n\ns.t. |xI\n\ni | \u2264 \u03b1\n\ni \u2212 xI\n\ni \u2212 xI\n\ni )(cid:107)2\n\ni ; xI\n\nSf (T,X I ,V I ) =\n\nvI\ni sf (Ti, xI\ni )\n\n(2)\n\nK(cid:88)\n\ni=1\n\n1, ..., vI\n\nK} represents the selected template subset for image I, vI\n\nwhere V I = {vI\ntemplate Ti is used for image I, and X I = {xI\nimage I. The more templates that are used, the higher the score is.\nCo-occurrence: With the observation that certain shape patterns of two or more object parts coexist\nfrequently in the same image, it is desired that templates that have a high chance of co-occurring\nare selected together. For a given image, the co-occurrence term is used to encourage selecting two\ntemplates together, which have a large relation parameter wij. Meanwhile, a L1 penalty term is used\nto ensure sparsity of the template relation.\n\ni = 1 means that the\nK} represents the locations of all templates on\n\n1, ..., xI\n\nSc(W,V I ) =\n\nj wij \u2212 \u03bb\n\nvI\ni vI\n\n|wij|\n\ns.t. 0 \u2264 wij \u2264 1\n\n(3)\n\nK(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\nDiversity: This term is used to enforce spatial relationship constraints on the locations of selected\ntemplates. In particular, their locations should not be too close to each other, because we want the\nlearned templates to be diverse, so that they can cover a large range of image shape patterns. So this\nterm sums up a location penalty on the templates,\n\nSd(X I ,V I ) = \u2212 K(cid:88)\n\nK(cid:88)\n\ni=1\n\nj=1\n\nvI\ni vI\n\nj d(xI\n\ni , xI\nj )\n\n(4)\nj| < \u03b2 and\n\ni , xI\n\ni , xI\n\nj ) = 0, otherwise. \u03b2 is a distance parameter.\n\nj ) is the location penalty function. We have d(xI\n\nwhere d(xI\nd(xI\nSumming up all three terms de\ufb01ned above: \ufb01tness, co-occurrence and diversity terms for all images\nin the image set D, we have the overall score function between templates and images\n(Sf (T,X I ,V I ) + Sc(W,V I ) + Sd(X I ,V I ))\n\nS(T,W,X ,V, D) =\n\n(cid:88)\n\ni , xI\n\n(5)\n\nj ) = \u221e if |xI\n\ni \u2212 xI\n\nwhere V = {V 1,V 2, ...,V|D|} are template indicators, X = {X 1,X 2, ...,X |D|} are template loca-\ntions, and |D| is the number of images in the set D. The templates and their relations are learned by\nmaximizing the score function S(T,W,X ,V, D) on an image collection D.\n\nI\u2208D\n\n4\n\n\fAlgorithm 1 Template Model Learning\ninput Image set D, maximum iteration maxiter, threshold \u0001\noutput Template model M = {T,W}.\n\nInitialize {T1, T2, ..., TK} with training data; initialize wij = 0; iter = 0\nfor iter < maxiter do\n\nupdate X I, V I for all I \u2208 D based on equation (6)\nupdate W to optimize (9)\n\nI\u2208D vI\n\nI\u2208D vI\n\ni R(xI\n\ni (as in (8))\n\ni )/(cid:80)\n\nupdate T by: Ti =(cid:80)\nif(cid:80)\n\ni |\u2206Ti| < \u0001 then\nbreak\n\nend if\niter \u2190 iter + 1\n\nend for\n\n3.3 Template Learning\n\nWe use an alternating algorithm to optimize (5). The proposed algorithm iterates among three steps:\n\n\u2022 updating X ,V (template detection),\n\u2022 updating T (template feature learning), and\n\u2022 updating W (template relation learning).\n\nTemplate detection: Given a template model {T,W}, the goal of template detection is to \ufb01nd the\ntemplate subset V and their locations X for all images to maximize equation (5). The second term in\nSc in equation (3) is a constant given W. So maximizing (5) is reduced to maximizing the following\nterm for each image I respectively:\n\nK(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\nmaxX I ,V I\n\nvI\ni sf (Ti, xI\n\ni ) +\n\ni=1\n\ni=1\n\nj=1\n\nj (wij \u2212 d(xI\n\nvI\ni vI\n\ni , xI\n\nj ))\n\n(6)\n\nThe above optimization problem is NP-hard, so a greedy approach is used: the algorithm starts\nwith an empty set, \ufb01rst calculates the scores for all templates, and then selects the template with\nthe largest score. Fixing the locations of all previously selected templates, the next template and its\nlocation can be chosen in a similar manner. The procedure is repeated until the object function (6)\nno longer increases.\nTemplate feature learning: The goal of template feature learning is to optimize the templates T\ngiven the relation parameters W and current template detection results V, X . When maximizing (5),\nSd and Sc are all constants given V, X and W. The optimal template Ti can be found by maximizing\n(7)\n\nmax\n\ni )(cid:107)2)\n\nTi\n\n(cid:88)\ni (1 \u2212 (cid:107)Ti \u2212 R(xI\nvI\n(cid:88)\n(cid:88)\n\nI\u2208D\n\nvI\ni R(xI\n\ni )/\n\nvI\ni\n\nI\u2208D\n\nTi =\n\nI\u2208D\n\nwhich can be solved by the closed form equation\n\n(8)\n\nEq (8) means that the template Ti is updated by the average of the features of all sub-images in D\nthat are detected by the i-th template.\nTemplate relation learning: The goal here is to assign values to the relation parameters W given\nall other parameters (T, V and X ) for the purpose of maximizing equation (5). Since only W are\noptimization parameters, Sf and Sd are both constants. Optimizing (5) is simpli\ufb01ed as maximizing\n\nK(cid:88)\n\nK(cid:88)\n\n(cid:88)\n\nI\u2208D\n\nmaxW\n\nwij\n\nvI\ni vI\n\nj \u2212 \u03bb|D| K(cid:88)\n\nK(cid:88)\n\nj=1\nA L1 regularization solver [26] is used for optimizing this formula.\n\nj=1\n\ni=1\n\ni=1\n\n|wij|\n\n(9)\n\n5\n\n\fT1\n\nT2\n\nT3\n\nT4\n\nT5\n\nFigure 3: Object parts (black squares) detected by learned templates. Each line shows the parts\nfound by one learned template. The sub-image within the black square has the highest matching\nscore for a given image. Meaningful parts are successfully detected such as heads, backs and tails.\n\nThe whole learning procedure is summarized in Algorithm 1. The algorithm starts by initiating\nK templates with various sizes and initial locations that are are evenly spaced in an image. In each\niteration, template detection, template feature learning, and template relation learning are alternated.\nThe iteration continues until the total change of template {Ti}K\n\ni=1 is smaller than a threshold \u0001.\n\n4 Experiments\n\nWe tested our model on two publicly available datasets: Caltech-UCSD Bird-200 and Stanford Dog.\nThese two datasets are the standard benchmarks to evaluate \ufb01ne-grained object recognition algo-\nrithms. Our experiments suggest that the proposed template model is able to detect the meaningful\nparts and outperforms the previous work in terms of accuracy.\n\n4.1 Features and Settings\n\nWe use kernel descriptors (KDES) to capture low-level image statistics: color, shape and texture [3].\nIn particular, we use four types of kernel descriptors: color-based, normalized color-based, gradient-\nbased, and local-binary-pattern-based descriptors1. Color and normalized color kernel descriptors\nare extracted over RGB images, and gradient and shape kernel descriptors are extracted over gray\nscale images transformed from the original RGB images. Following the standard parameter setting,\nwe compute kernel descriptors on 16 \u00d7 16 image patches over dense regular grids with spacing of\n8 pixels. For template relation learning, we use a publicly available L1 regularization solver 2. All\nimages are resized to be no larger than 300 \u00d7 300 with the height/width ratio preserved.\nTo learn the template model, we use 34 templates with different sizes. The template size is measured\nby its ratio to the original image size, such as 1/2 or 1/3. Our model has 9 templates with size 1/2\nand 25 with size 1/3. The initial locations of templates with each template size are evenly spaced\ngrid points in an image. We observe that the learning algorithm converges very fast and usually\nbecomes stable around 15 \u223c 20 iterations.The sparsity level parameter \u03bb is set to be 0.1. Other\nmodel parameters are \u03b1 = 24 and \u03b2 = 32 pixels. These parameters are optimized by performing\ncross validation on training set of the Bird dataset. The same parameter setting is then applied to the\n\n1http://www.cs.washington.edu/ai/Mobile_Robotics/projects/kdes/\n2http://www.di.ens.fr/\u02dcmschmidt/Software/L1General.html\n\n6\n\n\fTable 1: The table in the left show the classi\ufb01cation accuracies (%) obtained by templates with\ndifferent sizes and numbers on a subset of a full dataset. The accuracy is improved with an increasing\ntemplate number at the beginning, and become saturated when enough templates are used. With the\nbest template number choices, the combination of templates with different sizes are tested. The table\nin the right shows the accuracies (%) achieved by different combinations on the full dataset. The\ncombination of 9 templates with size 1/2 and 25 templates with size 1/3 performs best (selected\nusing the training set).\n\nAcc\nT 1\nT 1\nT 1\nT 1\n\n3\n\n2\n\n4\n\n1\n46.1\n39.6\n33.2\n32.1\n\n4\n\n46.1\n46.8\n42.9\n37.5\n\n9\n\n46.1\n50.7\n41.8\n40.4\n\n16\n46.1\n50.7\n43.9\n40\n\n25\n46.1\n48.9\n44.3\n40.4\n\n36\n46.1\n47.5\n44.3\n40\n\nCombination\n\n9T 1\n\n2\n\nT 1 + 9T 1\n9T 1\n2 + 25T 1\nT 1 + 9T 1\n2 + 25T 1\n\n2\n\n3\n\n3 + 25T 1\n\n4\n\nAcc\n27.1\n27.4\n28.2\n28.2\n\nTable 2: Effect of sparsity parameter \u03bb: that the best accuracy is achieved when \u03bb = 0.1.\n\n\u03bb\n\nAccur\n\n0\n\n48.57\n\n0.001\n48.93\n\n0.005\n49.28\n\n0.01\n49.29\n\n0.05\n49.64\n\n0.1\n50.7\n\n0.5\n50\n\n1\n\n48.57\n\nDog dataset. On each region detected by templates, we compute template-level features using EMK\nfeatures [4]. After obtaining these template-level features, we train a linear support vector machine\nfor \ufb01ne-grained object recognition.\nNotice that there is a slight difference between template detection in the learning phase and in the\nrecognition phase. In the learning phase, only a subset of templates are detected for each image.\nThis is because not all templates can be observed in all images, and each image usually contains\nonly a subset of all possible templates. But in the recognition phase, all templates are selected for\ndetection in order to avoid missing features.\n\n4.2 Bird Recognition\n\nCaltech-UCSD Bird-200 [8] is a commonly used dataset for evaluating \ufb01ne-grained object recogni-\ntion algorithms. The dataset contains 6033 images from 200 bird species in North America. In each\nimage, the bounding box of a bird is given. Following the standard setting [5], 15 images from each\nspecies are used for training and the rest for testing.\nTemplate learning: Figure 3 visualizes the rectangles/parts detected by the learned templates. The\nfeature in each template consists of a vector of real numbers. As can be seen, the learned templates\nsuccessfully \ufb01nd the meaningful parts of birds, though the appearances of these parts are very differ-\nent. For examples, the head parts detected by T1 have quite different colors and textures, suggesting\nthe robustness of the proposed template model.\nSparsity parameter \u03bb: We tested different values for the sparsity level parameter \u03bb on a subset\nof 20 categories (from the training set) for ef\ufb01ciency. If \u03bb = 0, there is no penalty on the relation\nparameters W, thus all weights wij are set to 1 when the template model is learned. If \u03bb \u2265 1, the\npenalty on the relation parameters is large enough that all wij are set to 0 after learning. In both these\ncases, the template models are equivalent to a simpli\ufb01ed model without the co-occurrence term in\n(3). If \u03bb is a number between 0 and 1, test results in Table 2 show that the best accuracy is achieved\nwhen \u03bb = 0.1.\nTemplate size and number choices: We tested the effect of the number and size of the templates\non the recognition accuracy. All the results are obtained on a subset of 20 categories for ef\ufb01ciency.\nWhen the template size is 1, the accuracy is the same with an arbitrary template number, because\ntemplate detection will return the same results. For templates whose size is smaller than 1, the results\nobtained with different numbers of templates are shown in Table 1 left. Based on these results,\nwe selected a template number for each template size for further experiments: one template with\nsize 1, 9 templates with size 1/2, 25 templates with size 1/3, and 25 templates with size 1/4. The\nresults obtained by the combinations of templates with different sizes (each with its optimal template\nnumber) on the full dataset are shown in Table 1 right. The highest accuracy is achieved by the\n\n7\n\n\fTable 3: Comparisons on Caltech-UCSD Bird-200. Our template model is compared to the recently\nproposed \ufb01ne-grained recognition algorithms. The performance is measured in terms of accuracy.\n\nMKL [5]\n\n19.0\n\n18.0\n\n19.2\n\nLLC [9] Rand-forest [9] Multi-cue [6] KDES [3]\n\n22.4\n\n26.4\n\nThis work\n\n28.2\n\nTable 4: Comparisons on Stanford Dog Dataset. Our approach is compared to a baseline algorithm\nin [27] and KDES with spatial pyramid. We give the results of the proposed template model with\ntwo types of templates: edge templates and texture templates.\n\nMethods\n\nAccuracy(%)\n\nSIFT [27] KDES [3]\n\n22.0\n\n36.0\n\nEdge templates\n\n38.0\n\nTexture templates\n\n36.9\n\ncombination of 9 templates with size 1/2 and 25 templates with size 1/3. Our further experiments\nsuggest that adding more templates only slightly improves the recognition accuracy.\nRunning time: Our algorithm is ef\ufb01cient. With a non-optimized version of the algorithm, in the\ntraining stage, each iteration takes 2 \u223c 3 minutes to update. In the test stage, it takes 3 \u223c 5 seconds\nto process each image, including template detection, feature extraction and classi\ufb01cation. This is\nfast enough for an on-line recognition task.\nComparisons with the state-of-the-art algorithms: We compared our model with four recently\npublished algorithms for \ufb01ne-grained object recognition: multiple kernel learning [5], random for-\nest [9], LLC [9], and multi-cue [6] in Table 3. We also compared our model to KDES [3] with spatial\npyramid, a strong baseline in terms of accuracy.\nWe observe that KDES with spatial pyramid works well on this dataset, and the proposed template\nmodel works even better. The template model achieves 28.2% accuracy, about 6 percents higher than\nthe best results reported in the previous work and about 2 percents higher than KDES with spatial\npyramid. This accuracy is comparable with the recently proposed pose pooling approach [12] where\nlabeled parts are used to train and test models; this is not required for our template model.\n\n4.3 Dog Recognition\n\nThe Stanford Dogs dataset is another benchmark dataset for \ufb01ne-grained image categorization re-\ncently introduced in [27]. The dataset contains 20, 580 images of 120 breeds of dogs from around\nthe world. Bounding boxes of dogs are provided for all images in the dataset. This dataset is a\ngood complement to the Caltech-UCSD Bird200 due to more images in each category: around 200\nimages per class versus 30 images per class in Bird200. Following the standard setting [27], 100\nimages from each category are used for training and the rest for testing.\nComparisons with the state-of-art algorithms: We compared our model with a baseline algo-\nrithm [27] and KDES with spatial pyramid on this dataset. For the dog datasets, we also tried using\nthe local binary pattern KDES to learn templates instead of the edge KDES due to the relative con-\nsistent textures in dog images. Our experiments show that the template learning with the edge KDES\nworks better than that with the local binary pattern KDES, suggesting that the edge information is a\nstable cue to learn templates. Notice that the accuracy achieved by our template model is 16 percent\nhigher than the best published results so far.\n\n5 Conclusion\n\nWe have proposed a template model for \ufb01ne-grained object recognition. The template model learns a\ngroup of templates by jointly considering \ufb01tness, co-occurrence and diversity between the templates\nand images, and the learned templates are used to align image regions that contain the same object\nparts. Our experiments show that the proposed template model has achieved higher accuracy than the\nstate-of-the-art \ufb01ne-grained object recognition algorithms on the two standard benchmarks: Caltech-\nUCSD Bird-200 and Standford Dogs. In the future, we plan to learn the features that are suitable for\ndetecting object parts and incorporate the geometric information into the template relationships.\n\n8\n\n\fReferences\n[1] Farrell, R., Oza, O., Zhang, N., Morariu, V., Darrell, T., Davis, L.: Birdlets: subordinate\n\ncategorization using volumetric primitives and pose-normalized appearance. ICCV (2011)\n\n[2] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for\n\nrecognizing natural scene categories. CVPR (2006)\n\n[3] Bo, L., Ren, X., Fox, D.: Kernel Descriptors for Visual Recognition. NIPS (2010)\n[4] Bo, L., Sminchisescu, C.: Ef\ufb01cient match kernel between sets of features for visual recogni-\n\ntion. NIPS (2009)\n\n[5] Branson, S., Wah, C., Babenko, B., Schroff, F., Welinder, P., Perona, P., Belongie, S.: Visual\n\nrecognition with humans in the loop. ECCV (2010)\n\n[6] Khan, F., van de Weijer, J., Bagdanov, A., Vanrell, M.: Portmanteau vocabularies for multi-cue\n\nimage representations. NIPS (2011)\n\n[7] Wah, C., Branson, S., Perona, P., Belongie, S.:\n\n\ufb01ne-grained visual categories. ICCV (2011)\n\nInteractive localization and recognition of\n\n[8] Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd\n\nbirds 200. Technical Report CNS-TR-201, Caltech (2010)\n\n[9] Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for \ufb01ne-grained\n\nimage categorization. CVPR (2011)\n\n[10] Yao, B., Bradski, G., Fei-Fei, L.: A codebook-free and annotation-free approach for \ufb01ne-\n\ngrained image categorization. CVPR (2012)\n\n[11] Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for \ufb01ne-\n\ngrained recognition. CVPR (2012)\n\n[12] Zhang, N., Farrell, R., Darrell, T.: Pose pooling kernels for sub-category recognition. CVPR\n\n(2012)\n\n[13] Bourdev, L., Malik, J.: Poselets: body partddetectors trained using 3d human pose annotations.\n\nICCV (2009)\n\n[14] Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with dis-\ncriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine\nIntelligence 32 (2010)\n\n[15] Parkhi, O., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. CVPR (2012)\n[16] Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60 (2004)\n[17] Lee, H., Battle, A., Raina, R., Ng, A.: Ef\ufb01cient sparse coding algorithms. NIPS (2007)\n[18] Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding\n\nfor image classi\ufb01cation. CVPR (2009)\n\n[19] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Guo, Y.: Locality-constrained linear coding for\n\nimage classi\ufb01cation. CVPR (2010)\n\n[20] Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition.\n\nCVPR (2010)\n\n[21] Coates, A., Ng, A.: The importance of encoding versus training with sparse coding and vector\n\nquantization. ICML (2011)\n\n[22] Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchi-\n\ncal sparse coding. CVPR (2011)\n\n[23] Boureau, Y., Ponce, J.: A theoretical analysis of feature pooling in visual recognition. ICML\n\n(2010)\n\n[24] Chang, C., Lin, C.: LIBSVM: a library for support vector machines. (2001)\n[25] Maire, M., Arbelaez, P., Fowlkes, C., Malik, J.: Using contours to detect and localize junctions\n\nin natural images. CVPR (2008)\n\n[26] Schmidt, M., Fung, G., Rosales, R.: Optimization methods for L1-regularization. UBC Tech-\n\nnical Report (2009)\n\n[27] Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for \ufb01ne-grained image\n\ncategorization. First Workshop on Fine-Grained Visual Categorization, CVPR (2011)\n\n9\n\n\f", "award": [], "sourceid": 1437, "authors": [{"given_name": "Shulin", "family_name": "Yang", "institution": null}, {"given_name": "Liefeng", "family_name": "Bo", "institution": null}, {"given_name": "Jue", "family_name": "Wang", "institution": null}, {"given_name": "Linda", "family_name": "Shapiro", "institution": null}]}