{"title": "Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach", "book": "Advances in Neural Information Processing Systems", "page_first": 181, "page_last": 189, "abstract": "Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object classes. In order to address this shortcoming, in recent years several authors have proposed to learn object classifiers from weakly-labeled Internet images, such as photos retrieved by keyword-based image search engines. While this strategy eliminates the need for human supervision, the recognition accuracies of these methods are considerably lower than those obtained with fully-supervised approaches, because of the noisy nature of the labels associated to Web data.  In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e.g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search. We cast this as a domain adaptation problem: given a few strongly-labeled examples in a target domain (the manually annotated examples) and many source domain examples (the weakly-labeled Web photos), learn classifiers yielding small generalization error on the target domain. Our experiments demonstrate that, for the same number of strongly-labeled examples, our domain adaptation approach produces significant recognition rate improvements over the best published results (e.g., 65% better when using 5 labeled training examples per class) and that our classifiers are one order of magnitude faster to learn and to evaluate than the best competing method, despite our use of large weakly-labeled data sets.", "full_text": "Exploiting weakly-labeled Web images to improve\nobject classi\ufb01cation: a domain adaptation approach\n\nAlessandro Bergamo\n\nLorenzo Torresani\n\nComputer Science Department\n\nDartmouth College\n\nHanover, NH 03755, U.S.A.\n\n{aleb, lorenzo}@cs.dartmouth.edu\n\nAbstract\n\nMost current image categorization methods require large collections of man-\nually annotated training examples to learn accurate visual recognition models.\nThe time-consuming human labeling effort effectively limits these approaches to\nrecognition problems involving a small number of different object classes. In or-\nder to address this shortcoming, in recent years several authors have proposed to\nlearn object classi\ufb01ers from weakly-labeled Internet images, such as photos re-\ntrieved by keyword-based image search engines. While this strategy eliminates\nthe need for human supervision, the recognition accuracies of these methods are\nconsiderably lower than those obtained with fully-supervised approaches, because\nof the noisy nature of the labels associated to Web data.\nIn this paper we investigate and compare methods that learn image classi\ufb01ers by\ncombining very few manually annotated examples (e.g., 1-10 images per class)\nand a large number of weakly-labeled Web photos retrieved using keyword-based\nimage search. We cast this as a domain adaptation problem: given a few strongly-\nlabeled examples in a target domain (the manually annotated examples) and many\nsource domain examples (the weakly-labeled Web photos), learn classi\ufb01ers yield-\ning small generalization error on the target domain. Our experiments demonstrate\nthat, for the same number of strongly-labeled examples, our domain adaptation\napproach produces signi\ufb01cant recognition rate improvements over the best pub-\nlished results (e.g., 65% better when using 5 labeled training examples per class)\nand that our classi\ufb01ers are one order of magnitude faster to learn and to evaluate\nthan the best competing method, despite our use of large weakly-labeled data sets.\n\n1 Introduction\n\nThe last few years have seen a proliferation of human efforts to collect labeled image data sets\nfor the purpose of training and evaluating visual recognition systems. Label information in these\ncollections comes in different forms, ranging from simple object category labels to detailed semantic\npixel-level segmentations. Examples include Caltech256 [14], and the Pascal VOC2010 data set [7].\nIn order to increase the variety and the number of labeled object classes, a few authors have designed\nonline games and appealing software tools encouraging common users to participate in these image\nannotation efforts [23, 30]. Despite the tremendous research contribution brought by such attempts,\neven the largest labeled image collections today [6] are limited to a number of classes that is at least\none order of magnitude smaller than the number of object categories that humans can recognize [3].\nIn order to overcome this limitation and in an attempt to build classi\ufb01ers for arbitrary object classes,\nseveral authors have proposed systems that learn from weakly-labeled Internet photos [10, 9, 29, 20].\nMost of these approaches rely on keyword-based image search engines to retrieve image examples\nof speci\ufb01ed object classes. Unfortunately, while image search engines provide training examples\n\n1\n\n\fwithout the need of any human intervention, it is suf\ufb01cient to type a few example keywords in\nGoogle or Bing image search to verify that often the majority of the retrieved images are only\nloosely related with the query concept. Most prior work has attempted to address this problem\nby means of outlier rejection mechanisms discarding irrelevant images from the retrieved results.\nHowever, despite the dynamic research activity in this area, weakly-supervised approaches today\nstill yield signi\ufb01cantly lower recognition accuracy than fully supervised object classi\ufb01ers trained on\nclean data (see, e.g., results reported in [9, 29]).\n\nIn this paper we argue that the poor performance of models learned from weakly-labeled Internet\ndata is not only due to undetected outliers contaminating the training data, but it is also a conse-\nquence of the statistical differences often present between Web images and the test data. Figure 1\nshows sample images for some of the Caltech256 object categories versus the top six images re-\ntrieved by Bing using the class names as keywords1. Although a couple of outliers are indeed\npresent in the Bing sets, the striking difference between the two collections is that even the relevant\nresults in the Bing groups appear to be visually less homogeneous. For example, in the case of the\nclasses shown in \ufb01gure 1(a,b), while the Caltech256 groups contain only real photographs, the Bing\ncounterparts include several cartoon drawings. In \ufb01gure 1(c,d), each Caltech256 image contains\nonly the object of interest while the pictures retrieved by Bing include extraneous items, such as\npeople or faces, which act as distractors in the learning (this is particularly true when evaluating the\nclassi\ufb01ers on Caltech256, given that \u201dfaces\u201d and \u201dpeople\u201d are separate categories in the data set).\nFurthermore, even when \u201dirrelevant\u201d results do occur in the retrieved images, they are rarely outliers\ndetectable via simple coherence tests as there is often some consistency even among such photos.\nFor example, polysemy \u2014 the capacity of one word to have multiple meanings \u2014 causes multiple\nvisual clusters (as opposed to individual outliers) to appear in the Bing sets of \ufb01gure 1(e,f) (the two\nclusters in (e) are due to the fact that the word \u201dhawksbill\u201d denotes both a crag in Arkansas as well as\na type of sea turtle, while in the case of (f) the keyword \u201dtricycle\u201d retrieves images of both bicycles\nas well as motorcycles with three wheels; note, again, that Caltech256 contains for both classes only\nimages corresponding to one of the words meanings and that \u201dmotorcycle\u201d appears as a separate\nadditional category). Finally, in some situations, different shooting distances or angles may produce\ncompletely unrelated views of the same object or scene: for example, the Bing set in 1(g) includes\nboth aerial and ground views of Mars, which have very little in common visually.\n\nNote that for most of the classes in \ufb01gure 1 it is not clear a priori which are the \u201crelevant\u201d Internet\nimages to be used for training until we compare them to the photos in the corresponding Caltech256\ncategories. In this paper we show that a few strongly-labeled examples from the test domain (e.g. a\nfew Caltech256 images for the class of interest) are indeed suf\ufb01cient to disambiguate this relevancy\nproblem and to model the distribution differences between the weakly-labeled Internet data and the\ntest application data, so as to signi\ufb01cantly improve recognition performance on the test set.\n\nThe situation where the test data is drawn from a distribution that is related, but not identical, to\nthe distribution of the training data has been widely studied in the \ufb01eld of machine learning and it\nis traditionally addressed using so-called \u201ddomain adaptation\u201d methods. These techniques exploit\nample availability of training data from a source domain to learn a model that works effectively\nin a related target domain for which only few training examples are available. More formally, let\npt(X, Y ) and ps(X, Y ) be the distributions generating the target and the source data, respectively.\nHere, X denotes the input (a random feature vector) and Y the class (a discrete random variable).\nThe domain adaptation problem arises whenever pt(X, Y ) differs from ps(X, Y ). In covariance\nshift, it is assumed that only the distributions of the input features differ in the two domain, i.e.,\npt(Y |X) = ps(Y |X) but pt(X) 6= ps(X). Note that, without adaptation, this may lead to poor clas-\nsi\ufb01cation in the target domain since a model learned from a large source training set will be trained\nto perform well in the dense source regions of X which, under the covariance shift assumption,\nwill generally be different from the dense regions of the target domain. Typically, covariance shift\nalgorithms (e.g., [16]) address this problem by modeling the ratio pt(X)/ps(X). Unfortunately, the\nmuch more common and challenging case is when the conditional distributions are different, i.e.,\npt(Y |X) 6= ps(Y |X). When such differences are relatively small, however, knowledge gained by\nanalyzing data in the source domain may still yield valuable information to perform prediction for\ntest target data. This is precisely the scenario considered in this paper.\n\n1Note that image search results may have changed since these examples were captured.\n\n2\n\n\fCaltech256\n\nBing\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\nFigure 1: Images in Caltech256 for several categories and top results retrieved by Bing image search\nfor the corresponding keywords. The Bing sets are both semantically and visually less coherent:\npresence of multiple objects in the same image, polysemy, caricaturization, as well as variations in\nviewpoints are some of the visual effects present in Internet images which cause signi\ufb01cant data\ndistribution differences between the Bing sets and the corresponding Caltech256 groups.\n\n3\n\n\f2 Relationship to other methods\n\nMost of the prior work on learning visual models from image search has focused on the task of\n\u201ccleaning up\u201d Internet photos. For example, in the pioneering work of Fergus et al. [10], visual \ufb01lters\nlearned from image search were used to rerank photos on the basis of visual consistency. Subsequent\napproaches [2, 25, 20] have employed similar outlier rejection schemes to automatically construct\nclean(er) data sets of images for training and testing object classi\ufb01ers. Even techniques aimed at\nlearning explicit object classi\ufb01ers from image search [9, 29] have identi\ufb01ed outlier removal as the\nkey-ingredient to improve recognition. In our paper we focus on another fundamental, yet largely\nignored, aspect of the problem: we argue that the current poor performance of classi\ufb01cation models\nlearned from the Web is due to the distribution differences between Internet photos and image test\nexamples. To the best of our knowledge we propose the \ufb01rst systematic empirical analysis of domain\nadaptation methods to address sample distribution differences in object categorization due to the use\nof weakly-labeled Web images as training data. We note that in work concurrent to our own, Saenko\net al. [24] have also analyzed cross-domain adaptation of object classi\ufb01ers. However, their work\nfocuses on the statistical differences caused by varying lighting conditions (uncontrolled versus\nstudio setups) and by images taken with different camera types (a digital SLR versus a webcam).\n\nTransfer learning, also known as multi-task learning, is related to domain adaptation. In computer\nvision, transfer learning has been applied to a wide range of problems including object categorization\n(see, e.g., [21, 8, 22]). However, transfer learning addresses a different problem. In transfer learning\nthere is a single distribution of the inputs p(X) but there are multiple output variables Y1, . . . , YT ,\nassociated to T distinct tasks (e.g., learning classi\ufb01ers for different object classes). Typically, it\nis assumed that some relations exist among the tasks; for example, some common structure when\nlearning classi\ufb01ers p(Y1|X, \u03b81), . . . , p(YT |X, \u03b8T ) can be enforced by assuming that the parameters\n\u03b81, . . . , \u03b8T are generated from a shared prior p(\u03b8). The fundamental difference is that in domain\nadaptation we have a single task but different domains, i.e., different sources of data.\n\nAs our approach relies on a mix of labeled and weakly-labeled images, it is loosely related to semi-\nsupervised methods for object classi\ufb01cation [15, 19]. Within this genre, the algorithm described\nin [11] is perhaps the closest to our work as it also relies on weakly-labeled Internet images. How-\never, unlike our approach, these semi-supervised methods are designed to work in cases where the\ntest examples and the training data are generated from the same distribution.\n\n3 Approach overview\n\n3.1 Experimental setup\nOur objective is to evaluate domain adaptations methods on the task of object classi\ufb01cation, using\nphotos from a human-labeled data set as target domain examples and images retrieved by a keyword-\nbased image search engine as examples of the source domain.\n\nWe used Caltech256 as the data set for the target domain since it is an established benchmark for\nobject categorization and it contains a large number of classes (256) thus allowing us to average out\nperformance variations due to especially easy or dif\ufb01cult categories. From each class, we randomly\nsampled nT images as target training examples, and other mT images as target test examples.\nWe formed the weakly-labeled source data by collecting the top nS images retrieved by Bing im-\nage search for each of the Caltech256 category text labels. Although it may have been possible to\nimprove the relevancy of the image results for some of the classes by manually selecting less am-\nbiguous search keywords, we chose to issue queries on the unchanged Caltech256 text class labels\nto avoid subjective alteration of the results. However, in order to ensure valid testing, we removed\nnear duplicates of Caltech256 images from the source training set by a human-supervised process.\n\n3.2 Feature representation and classi\ufb01cation model\nIn order to study the effect of large weakly-labeled training sets on object recognition performance,\nwe need a baseline system that achieves good performance on object categorization and that supports\nef\ufb01cient learning and test evaluation. The current best published results on Caltech256 were obtained\nby a kernel combination classi\ufb01er using 39 different feature kernels, one for each feature type [13].\nHowever, since both training as well testing are computationally very expensive with this classi\ufb01er,\nthis model is unsuitable for our needs.\n\n4\n\n\fInstead, in this work we use as image representation the classeme features recently proposed by\nTorresani et al. [28]. This descriptor is particularly suitable for our task as it has been shown to\nyield near state-of-the-art results with simple linear support vector machines, which can be learned\nvery ef\ufb01ciently even for large training sets. The descriptor measures the closeness of an image\nto a basis set of classes and can be used as an intermediate representation to learn classi\ufb01ers for\nnew classes. The basis classi\ufb01ers of the classeme descriptor are learned from weakly-labeled data\ncollected for a large and semantically broad set of attributes (the \ufb01nal descriptor contains 2659\nattributes). To eliminate the risk of the test classes being already explicitely represented in the feature\nvector, in this work we removed from the descriptor 34 attributes, corresponding to categories related\nto Caltech256 classes. We use a binarized version of this descriptor obtained by thresholding to 0\nthe output of the attribute classi\ufb01ers: this yields for each image a 2625-dimensional binary vector\ndescribing the predicted presence/absence of visual attributes in the photo. This binarization has\nbeen shown to yield very little degradation in recognition performance (see [28] for further details).\nWe denote with f (x) \u2208 {0, 1}F the binary attribute vector extracted from image x with F = 2625.\nObject class recognition is traditionally formulated as a multiclass classi\ufb01cation problem: given a\ntest image x, predict the class label y \u2208 {1, . . . , K} of the object present in it, where K is the\nnumber of possible classes (in the case of Caltech256, K = 256). In this paper we implement\nmulti-class classi\ufb01cation using K binary classi\ufb01ers trained using the one-versus-the-rest scheme\nand perform prediction according to the winner-take-all strategy. The k-th binary classi\ufb01er (distin-\nk and a collection\nguishing between class k and the other classes) is trained on a target training set Dt\nk is formed by aggregating the Caltech256 train-\nDs\ning images of all classes, using the data from the k-th class as positive examples and the data from\ni) denotes\nthe remaining classes as negative examples, i.e. Dt\nthe feature vector of the i-th image, Nt = (K \u00b7 nt) is the total number of images in the strongly-\ni,k \u2208 {\u22121, 1} is 1 iff example i belongs to class k. The source training\nlabeled data set, and yt\ni=1 is the collection of ns images retrieved by Bing using the category name of\nset Ds\nthe k-th class as keyword. As discussed in the next section, different methods will make different\nassumptions on the labels of the source examples.\n\nk of weakly-labeled source training examples. Dt\n\nk = {(f t\n\ni,k)}Nt\n\ni=1 where f t\n\nk = {f s\n\ni,k}ns\n\ni, yt\n\ni = f (xt\n\nWe adopt a linear SVM as the model for the binary one-vs-the-rest classi\ufb01ers. This choice is pri-\nmarily motivated by the availability of several simple yet effective domain adaptation variants of\nSVM [5, 26], in addition to the aforementioned reasons of good performance and ef\ufb01ciency.\n\n4 Methods\n\nWe now present the speci\ufb01c domain adaptation SVM algorithms. For brevity, we drop the subscript\nk indicating dependence on the speci\ufb01c class. The hyperparameters C of all classi\ufb01ers are selected\nso as to minimize the multiclass cross validation error on the target training data. For all algorithms,\nwe cope with the largely unequal number of positive and negative examples by normalizing the cost\nentries in the loss function by the respective class sizes.\n\n4.1 Baselines: SVMs, SVMt, SVMs\u222at\nWe include in our evaluation three algorithms not based on domain adaptation and use them as\ncomparative baselines. We indicate with SVMt a linear SVM learned exclusively from the target\nexamples. SVMs denotes an SVM learned from the source examples using the one-versus-the-rest\nscheme and assuming no outliers are present in the image search results. SVMs\u222at is a linear SVM\ntrained on the union of the target and source examples. Speci\ufb01cally, for each class k, we train a\nbinary SVM on the data obtained by merging Dt\nk, where the data in the latter set is assumed\nto contain only positive examples, i.e., no outliers. The hyperparameter C is kept the same for all K\nbinary classi\ufb01ers but tuned distinctly for each of the three methods by selecting the hyperparameter\nvalue yielding the best multiclass performance on the target training set (we used hold out validation\non Dt\n\nk for SVMs and 5-fold cross validation for both SVMt as well SVMs\u222at).\n\nk with Ds\n\n4.2 Mixture of source and target hypotheses: MIXSVM\nOne of the simplest possible strategies for domain adaptation consists of using as \ufb01nal classi\ufb01er a\nconvex combination of the two SVM hypotheses learned independently from the source and target\ndata. Despite its simplicity, this classi\ufb01er has been shown to yield good empirical results [26].\n\n5\n\n\fLet us represent the source and target multiclass hypotheses as vector-valued functions hs(f ) \u2192\nRK, ht(f ) \u2192 RK, where the k-th outputs are the respective SVM scores for class k. MIXSVM\ncomputes a convex combination h(f ) = \u03b2hs(f )+(1\u2212\u03b2)ht(f ) and predicts the class k\u2217 associated\nto the largest output, i.e. k\u2217 = arg maxk\u2208{1,...,K} hk(f ). The parameter \u03b2 \u2208 [0, 1] is determined\nvia grid search by optimizing multiclass error on the target training set. We avoid biased estimates\nresulting from learning the hypothesis ht and \u03b2 on the same training set by applying a two-stage\nprocedure: we learn 5 distinct hypotheses ht using 5-fold cross validation (with the hyperpameter\nvalue found for SVMt) and compute prediction ht(f t\ni using the cross\nvalidation hypothesis that was not trained on that example; we then use these predicted outputs to\ndetermine the optimal \u03b2. Last, we learn the \ufb01nal hypothesis ht using the entire target training set.\n\ni) at each training sample f t\n\n4.3 Domain weighting: DWSVM\nAnother straightforward yet popular domain adaptation approach is to train a classi\ufb01er using both\nthe source and the target examples by weighting differently the two domains in the learning objec-\ntive [5, 12, 4]. We follow the implementation proposed in [26] and weight the loss function values\ndifferently for the source and target examples by using two distinct SVM hyperparameters, Cs and\nCt, encoding the relative importance of the two domains. The values of these hyperparameters are\nselected by minimizing the multiclass 5-fold cross validation error on the target training set.\n\n4.4 Feature augmentation: AUGSVM\nWe denote with AUGSVM the domain adaptation method described in [5]. The key-idea of this\napproach is to create a feature-augmented version of each individual example f , where distinct\nfeature augmentation mappings \u03c6s, \u03c6t are used for the source and target data, respectively:\n\n\u03c6s(f ) = hf T f T 0TiT\n\nand\n\n\u03c6t(f ) = hf T 0T f TiT\n\n,\n\n(1)\n\nwhere 0 indicates a F -dimensional vector of zeros. A linear SVM is then trained on the union of\nthe feature-augmented source and target examples (using a single hyperparameter). The principle\nbehind this mapping is that the SVM trained in the feature-augmented space has the ability to distin-\nguish features having common behavior in the two domains (associated to the \ufb01rst F SVM weights)\nfrom features having different properties in the two domains.\n\n4.5 Transductive learning: TSVM\nThe previous methods implement different strategies to adjust the relative importance of the source\nand the training examples in the learning process. However, all these techniques assume that the\nsource data is fully and correctly labeled. Unfortunately, in our practical problem this assumption\nis violated due to outliers and irrelevant results being present in the images retrieved by keyword\nsearch. To tackle this problem we propose to perform transductive inference on the label of the\nsource data during the learning: the key-idea is to exploit the availability of strongly-labeled target\ntraining data to simultaneously determine the correct labels of the source training examples and\nincorporate this labeling information to improve the classi\ufb01er. To address this task we employ the\ntransductive SVM model introduced in [17]. Although this method is traditionally used to infer\nthe labels of unlabeled data available at learning time, it outputs a proper inductive hypothesis and\ntherefore can be used also to predict labels of unseen test examples. The problem of learning a\ntransductive SVM in our context can be formulated as follows:\n\nmin\nw,y s\n\n1\n2\n\n||w||2 + C t\n\nN t\n\nX\n\ni=1\n\ni l(yt\nct\n\ni wT f t\n\ni) +\n\nC s\nns\n\nns\n\nX\n\nj=1\n\nl(ys\n\nj wT f s\nj)\n\nsubject to\n\n1\nns\n\nns\n\nX\n\nj=1\n\nmax[0, sign(wT f s\n\nj )] = \u03c1\n\n(2)\n\nwhere l() denotes the loss function, w is the vector of SVM weights, ys contains the labels of the\nsource examples, and the ct\ni are scalar coef\ufb01cients used to counterbalance the effect of the unequal\nnumber of positive and negative examples: we set ct\ni = 1/((K \u2212 1)nt) other-\nwise. The scalar parameter \u03c1 de\ufb01nes the fraction of source examples that we expect to be positive\nand is tuned via cross validation. Note that TSVM solves jointly for the separating hyperplane and\nthe labels of the source examples by trading off maximization of the margin and minimization of the\n\ni = 1/nt if yt\n\ni = 1, ct\n\n6\n\n\f)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n \n\n5\n0\n\n \n\n50\n\nTSVM\nDWSVM\nMIXSVM\nAUGSVM\nSVMs\nSVMt\nSVMs \u222a t\n\n10\n\n20\n\n30\n\n40\n\nNumber of target training images (nt)\n\nFigure 2: Recognition accuracy obtained with\nns = 300 Web photos and a varying number of\nCaltech256 target training examples.\n\ny\nc\na\nr\nu\nc\nc\na\n \nh\nc\nt\na\nm\n \no\nt\n \n\nt\n\nM\nV\nS\n \ny\nb\n \nd\ne\nd\ne\ne\nn\n \ns\ne\np\nm\na\nx\ne\n \nl\na\nn\no\ni\nt\ni\nd\nd\na\n \n#\n\nl\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n# target training examples in TSVM\n\nFigure 3: Manual annotation saving: the plot\nshows for a varying number of labeled ex-\namples given to TSVM the number of addi-\ntional labeled images that would be needed\nby SVMt to achieve the same accuracy.\n\nprediction errors on both source and target data. This optimization can be interpreted as implement-\ning the cluster assumption, i.e., the expectation that points in a data cluster have the same label. We\nsolve the optimization problem in Eq. 2 for a quadratic soft-margin loss function l (i.e., l is chosen to\nbe the square of the hinge loss) using the minimization algorithm proposed in [27], which computes\nan ef\ufb01cient primal solution using the modi\ufb01ed \ufb01nite Newton method of [18]. This minimization\napproach is ideally suited to large-scale sparse data sets such as ours (about 70% of our features are\nzero). We used the same values of hyperparameters (C t, C s, and \u03c1) for all classes k = 1, . . . , K\nand selected them by minimizing the multiclass cross validation error. We also tried letting \u03c1 vary\nfor each individual class but that led to slightly inferior results, possibly due to over\ufb01tting.\n\n5 Experimental results\n\nWe now present the experimental results. Figure 2 shows the accuracy achieved by the different\nalgorithms when using ns = 300 and a varying number of training target examples (nt). The\naccuracy is measured as the average of the mean recognition rate per class, using mt = 25 test\nexamples for each class. The best accuracy is achieved by the domain adaptation methods TSVM and\nDWSVM, which produce signi\ufb01cant improvements over the SVM trained using only target examples\n(SVMt), particularly for small values of nt. For nt = 5, TSVM yields a 65% improvement over the\nbest published results on this benchmark (for the same number of examples, an accuracy of 16.7% is\nreported in [13]). Our method achieves this performance by analyzing additional images, the Internet\nphotos, but since these are collected automatically and do not require any human supervision, the\ngain we achieve is effectively \u201dhuman-cost free\u201d. It is interesting to note that while using solely\nsource training images yields very low accuracy (14.5% for SVMs), adding even just a single labeled\ntarget image produces a signi\ufb01cant improvement (TSVM achieves 18.5% accuracy with nt = 1,\nand 27.1% with nt = 5): this indicates that the method can indeed adapt the classi\ufb01er to work\neffectively on the target domain given a small amount of strongly-labeled data. It is interesting to\nnote that while TSVM implements a form of outlier rejection as it solves for the labels of the source\nexamples, DWSVM assumes that all source images in Ds\nk are positive examples for class k. Yet,\nDWSVM achieves results similar to those of TSVM: this suggests that domain adaptation rather than\noutlier rejection is the key-factor contributing to the improvement with respect to the baselines.\n\nBy analyzing the performance of the baselines in \ufb01gure 2 we observe that training exclusively with\nWeb images (SVMs) yields much lower accuracy than using strongly-labeled data (SVMt): this is\nconsistent with prior work [9, 29]. Furthermore, the poor accuracy of SVMs\u222at compared to SVMt\nsuggests that na\u00a8\u0131vely adding a large number of source examples to the target training set without\nconsideration of the domain differences not only does not help but actually worsens the recognition.\n\nFigure 3 illustrates the signi\ufb01cant manual annotation saving produced by our approach: the x-axis\nis the number of target labeled images provided to TSVM while the y-axis shows the number of\nadditional labeled examples that would be needed by SVMt to achieve the same accuracy.\n\n7\n\n\f)\n\n%\n\n(\n \ny\nc\na\nr\nu\nc\nc\nA\n\n30\n\n25\n\n20\n\n15\n\n10\n \n0\n\n \n\nTSVM\nDWSVM\nMIXSVM\nAUGSVM\nSVMs\nSVMt\nSVMs \u222a t\n\nns=50\nns=300\n\n \n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\nM\nV\nS\nT\n \nr\no\n\nf\n \n)\ns\ne\n\nt\n\ni\n\nu\nn\nm\nn\ni\n(\n \n\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nT\n\ni\n\n250\n50\nNumber of source training images (ns)\n\n150\n\n200\n\n100\n\n300\n\n0\n \n5\n\n10\n\n15\n\n20\n\nNumber of target training images (nt)\n\n25\n\n30\n\n35\n\n40\n\nFigure 4: Classi\ufb01cation accuracy of the differ-\nent methods using nt = 10 target training im-\nages and a varying number of source examples.\n\nFigure 5: Training time: time needed to learn\na multiclass classi\ufb01er for Caltech256 using\nTSVM.\n\nThe setting ns = 300 in the results above was chosen by studying the recognition accuracy as\na function of the number of source examples: we carried out an experiment where we \ufb01xed the\nnumber nt of target training example for each category to an intermediate value (nt = 10), and\nvaried the number ns of top image results used as source training examples for each class. Figure 4\nsummarizes the results. We notice that the performance of the SVM trained only on source images\n(SVMs) peaks at ns = 100 and decreases monotonically after this value. This result can be explained\nby observing that image search engines provide images sorted according to estimated relevancy with\nrespect to the keyword. It is conceivable to assume that images far down in the ranking list will often\ntend to be outliers, which may lead to degradation of recognition particularly for non-robust models.\nDespite this, we see that the domain adaptation methods TSVM and DWSVM exhibit a monotonically\nnon-decreasing accuracy as ns grows: this indicates that these methods are highly robust to outliers\nand can make effective use of source data even when increasing ns causes a likely decrease of the\nfraction of inliers and relevant results. Contrast these robust performances with the accuracy of\nSVMs\u222at, which grows as we begin adding source examples but then decays rapidly after ns = 10\nand approaches the poor recognition of SVMs for large values of ns.\nOur approach compares very favorably with competing algorithms also in terms of computational\ncomplexity: training TSVM (without cross validation) on Caltech256 with nt = 5 and ns = 300\ntakes 84 minutes on a AMD Opteron Processor 280 2.4GHz; training the multiclass method of [13]\nusing 5 labeled examples per class takes about 23 hours on the same machine (for fairness of com-\nparison, we excluded cross validation even for this method). A detailed analysis of training time as a\nfunction of the number of labeled training examples is reported in \ufb01gure 5. Evaluation of our model\non a test example takes 0.18ms, while the method of [13] requires 37ms.\n\n6 Discussion and future work\n\nIn this work we have investigated the application of domain adaptation methods to object categoriza-\ntion using Web photos as source data. Our analysis indicates that, while object classi\ufb01ers learned\nexclusively from Web data are inferior to fully-supervised models, the use of domain adaptation\nmethods to combine Web photos with small amounts of strongly labeled data leads to state-of-the-\nart results. The proposed strategy should be particularly useful in scenarios where labeled data is\nscarce or expensive to acquire. Future work will include application of our approach to combine\ndata from multiple source domains (e.g., images obtained from different search engines or photo\nsharing sites) and different media (e.g., text and video). Additional material including software and\nour source training data may be obtained from [1].\n\nAcknowledgments\n\nWe are grateful to Andrew Fitzgibbon and Martin Szummer for discussion. We thank Vikas Sind-\nhwani for providing code. This research was funded in part by NSF CAREER award IIS-0952943.\n\n8\n\n\fReferences\n[1] http://vlg.cs.dartmouth.edu/projects/domainadapt.\n[2] T. L. Berg and D. A. Forsyth. Animals on the web. In CVPR, pages 1463\u20131470, 2006.\n[3] I. Bierderman. Recognition-by-components: A theory of human image understanding. Psychological\n\nReview, 94(2):115\u2013147, 1987.\n\n[4] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation.\n\nIn NIPS, 2007.\n\n[5] H. Daume III. Frustratingly easy domain adaptation. In ACL, 2007.\n[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR, 2009.\n\n[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2010 (VOC2010) Results.\n\n[8] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. Pattern Anal.\n\nMach. Intell., 28(4):594\u2013611, 2006.\n\n[9] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google\u2019s image\n\nsearch. In ICCV, pages 1816\u20131823, 2005.\n\n[10] R. Fergus, P. Perona, and A. Zisserman. A visual category \ufb01lter for google images. In ECCV, 2004.\n[11] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In Y. Ben-\n\ngio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, NIPS 22, 2009.\n\n[12] J. R. Finkel and C. D. Manning. Hierarchical bayesian domain adaptation. In Proceedings of the North\n\nAmerican Association of Computational Linguistics (NAACL 2009), 2009.\n\n[13] P. V. Gehler and S. Nowozin. On feature combination for multiclass object classi\ufb01cation.\n\nInternational Conference on Computer Vision (ICCV), 2009.\n\nIn IEEE\n\n[14] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, Cali-\n\nfornia Institute of Technology, 2007.\n\n[15] A. Holub, M. Welling, and P. Perona. Exploiting unlabelled data for hybrid object classi\ufb01cation. In NIPS,\n\nInterclass transfer workshop, 2005.\n\n[16] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00a8olkopf. Correcting sample selection bias\n\nby unlabeled data. In NIPS, pages 601\u2013608, 2006.\n\n[17] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, pages\n\n200\u2013209, 1999.\n\n[18] S. S. Keerthi and D. DeCoste. A modi\ufb01ed \ufb01nite newton method for fast solution of large scale linear\n\nsvms. Journal of Machine Learning Research, 6:341\u2013361, 2005.\n\n[19] C. Leistner, H. Grabner, and H. Bischof. Semi-supervised boosting using visual similarity learning. In\n\nCVPR, 2008.\n\n[20] L. Li and L. Fei-Fei. Optimol: Automatic online picture collection via incremental model learning. Intl.\n\nJrnl. of Computer Vision, 88(2):147\u2013168, 2010.\n\n[21] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from one example through shared densities on\n\ntransforms. In CVPR, 2000.\n\n[22] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classi\ufb01cation with sparse prototype\n\nrepresentations. In CVPR, 2008.\n\n[23] B. C. Russell, A. B. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: A database and web-based\n\ntool for image annotation. International Journal of Computer Vision, 77(1-3):157\u2013173, 2008.\n\n[24] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains.\n\nEuropean Conference on Computer Vision (ECCV), Sept. 2010.\n\nIn\n\n[25] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007.\n[26] G. Schweikert, C. Widmer, B. Sch\u00a8olkopf, and G. R\u00a8atsch. An empirical analysis of domain adaptation\n\nalgorithms for genomic sequence analysis. In NIPS, pages 1433\u20131440, 2008.\n\n[27] V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, pages 477\u2013484, 2006.\n[28] L. Torresani, M. Szummer, and A. Fitzgibbon. Ef\ufb01cient object category recognition using classemes. In\n\nEuropean Conference on Computer Vision (ECCV), pages 776\u2013789, Sept. 2010.\n\n[29] S. Vijayanarasimhan and K. Grauman. Keywords to visual categories: Multiple-instance learning for-\n\nweakly supervised object categorization. In CVPR, 2008.\n\n[30] L. von Ahn. Games with a purpose. IEEE Computer, 39(6):92\u201394, 2006.\n\n9\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Alessandro", "family_name": "Bergamo", "institution": null}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": null}]}