{"title": "Co-Training for Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 2456, "page_last": 2464, "abstract": "Domain adaptation algorithms seek to generalize a model trained in a source domain to a new target domain.  In many  practical cases, the source and target distributions can differ substantially, and in some cases crucial target features may not have support in the source domain.  In this paper we introduce an algorithm that bridges the gap between source and target domains by slowly adding both the target features and instances in which the current  algorithm is the most confident.  Our algorithm is a variant of co-training, and we name it CODA (Co-training for domain adaptation).  Unlike the original co-training work, we do not assume a particular feature split.  Instead, for each iteration of co-training, we add target features and formulate a single optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a shared subset of source  and target features to include in the predictor.  CODA significantly out-performs the state-of-the-art on the 12-domain benchmark data set of Blitzer et al.. Indeed, over a wide range (65 of 84 comparisons) of target supervision, ranging from no labeled target data to a relatively large number of target labels, CODA achieves the best performance.", "full_text": "Co-Training for Domain Adaptation\n\nMinmin Chen, Kilian Q. Weinberger\n\nDepartment of Computer Science and Engineering\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\nmc15,kilian@wustl.edu\n\nJohn C. Blitzer\nGoogle Research\n\n1600 Amphitheatre Parkway\nMountain View, CA 94043\nblitzer@google.com\n\nAbstract\n\nDomain adaptation algorithms seek to generalize a model trained in a source do-\nmain to a new target domain. In many practical cases, the source and target dis-\ntributions can differ substantially, and in some cases crucial target features may\nnot have support in the source domain. In this paper we introduce an algorithm\nthat bridges the gap between source and target domains by slowly adding to the\ntraining set both the target features and instances in which the current algorithm\nis the most con\ufb01dent. Our algorithm is a variant of co-training [7], and we name\nit CODA (Co-training for domain adaptation). Unlike the original co-training\nwork, we do not assume a particular feature split. Instead, for each iteration of co-\ntraining, we formulate a single optimization problem which simultaneously learns\na target predictor, a split of the feature space into views, and a subset of source and\ntarget features to include in the predictor. CODA signi\ufb01cantly out-performs the\nstate-of-the-art on the 12-domain benchmark data set of Blitzer et al. [4]. Indeed,\nover a wide range (65 of 84 comparisons) of target supervision CODA achieves\nthe best performance.\n\n1\n\nIntroduction\n\nDomain adaptation addresses the problem of generalizing from a source distribution for which\nwe have ample labeled training data to a target distribution for which we have little or no la-\nbels [3, 14, 28]. Domain adaptation is of practical importance in many areas of applied machine\nlearning, ranging from computational biology [17] to natural language processing [11, 19] to com-\nputer vision [23].\nIn this work, we focus primarily on domain adaptation problems that are characterized by missing\nfeatures. This is often the case in natural language processing, where different genres often use\nvery different vocabulary to describe similar concepts. For example, in our experiments we use the\nsentiment data of Blitzer et al. [4], where a breeze to use is a way to express positive sentiment about\nkitchen appliances, but not about books. In this situation, most domain adaptation algorithms seek\nto eliminate the difference between source and target distributions, either by re-weighting source\ninstances [14, 18] or learning a new feature representation [6, 28].\nWe present an algorithm which differs from both of these approaches. Our method seeks to slowly\nadapt its training set from the source to the target domain, using ideas from co-training. We accom-\nplish this in two ways: First, we train on our own output in rounds, where at each round, we include\nin our training data the target instances we are most con\ufb01dent of. Second, we select a subset of\nshared source and target features based on their compatibility. Different from most previous work\non selecting features for domain adaptation, the compatibility is measured across the training set\nand the unlabeled set, instead of across the two domains. As more target instances are added to the\ntraining set, target speci\ufb01c features become compatible across the two sets, therefore are included\nin the predictor. Finally, we exploit the pseudo multiview co-training algorithm of Chen et al. [10]\n\n1\n\n\fto exploit the unlabeled data ef\ufb01ciently. These three intuitive ideas can be combined in a single\noptimization problem. We name our algorithm CODA (Co-Training for Domain Adaptation).\nBy allowing us to slowly change our training data from source to target, CODA has an advantage\nover representation-learning algorithms [6, 28], since they must decide a priori what the best repre-\nsentation is. In contrast, each iteration of CODA can choose exactly those few target features which\ncan be related to the current (source and pseudo-labeled target) training set. We \ufb01nd that in the\nsentiment prediction data set of Blitzer et al. [4] CODA improves the state-of-the-art cross widely\nvarying amounts of target labeled data in 65 out of 84 settings.\n\n2 Notation and Setting\n\nWe assume our data originates from two domains, Source (S) and Target (T). The source data\nis fully labeled DS = {(x1, y1), . . . , (xns, yns )} \u2282 Rd \u00d7 Y and sampled from some distribu-\ntion PS(X, Y ). The target data is sampled from PT (X, Y ) and is divided into labeled Dl\nT =\n{(x1, y1), . . . , (xnt, ynt)} \u2282 Rd \u00d7Y and unlabeled Du\nT = {(x1, ?), . . . (xmt, ?)} \u2282 Rd \u00d7Y parts,\nwhere in the latter the labels are unknown during training time. Both domains are of equal dimen-\nsionality d. Our goal is to learn a classi\ufb01er h \u2208 H to accurately predict the labels on the unlabeled\nportion of DT , but also to extend to out-of-sample test points, such that for any (x, y) sampled\nfrom PT , we have h(x) = y with high probability. For simplicity we assume that Y = {+1,\u22121},\nalthough our method can easily be adapted to multi-class or regression settings.\nWe assume the existence of a base classi\ufb01er, which determines the set H. Throughout this paper we\nsimply use logistic regression, i.e. our classi\ufb01er is parameterized by a weight-vector w \u2208 Rd and\nde\ufb01ned as hw(x) = (1 + e\u2212w(cid:62)x)\u22121. The weights w are set to minimize the loss function\n\n(cid:88)\n\n(x,y)\u2208D\n\n(cid:96)(w; D) = \u2212 1\n|D|\n\nlog(1 + exp(\u2212yw\n\n(cid:62)\n\nx)).\n\n(1)\n\nIf trained on data sampled from PS(X, Y ), logistic regression models the distribution PS(Y |X) [13]\nthrough Ph(Y = y|X = x; w) = (1 + e\u2212w(cid:62)xy)\u22121. In this paper, our goal is to adapt this classi\ufb01er\nto the target distribution PT (Y |X).\n\n3 Method\n\nIn this section, we begin with a semi-supervised approach and describe the rote-learning procedure to\nautomatically annotate target domain inputs. The algorithm maintains and grows a training set that is\niteratively adapted to the target domain. We then incorporate feature selection into the optimization,\na crucial element of our domain-adaptation algorithm. The feature selection addresses the change in\ndistribution and support from PS to PT . Further, we introduce pseudo multi-view co-training [7, 10],\nwhich improves the rote-learning procedure by adding inputs with features that are still not used\neffectively by the current classi\ufb01er. We use automated feature decomposition to arti\ufb01cially split our\ndata into multiple views, explicitly to enable successful co-training.\n\n3.1 Self-training for Domain Adaptation\n\nFirst, we assume we are given a loss function (cid:96) \u2013 in our case the log-loss from eq. (1) \u2013 which\nprovides some estimate of con\ufb01dence in its predictions. In logistic regression, if \u02c6y = sign(h(x)) is\nthe prediction for an input x, the probability Ph(Y = \u02c6y|X = x; w) is a natural metric of certainty\n(as h(x) can be interpreted as a probability for x to be of label +1), but other methods [22] can\nbe used. Self-training [19] is a simple and intuitive iterative algorithm to leverage unlabeled data.\nDuring training one maintains a labeled training set L and an unlabeled test set U, initialized as\nL = DS \u222a Dl\nT . Each iteration, a classi\ufb01er hw is trained to minimize the loss function\n(cid:96) over L and is evaluated on all elements of U. The c most con\ufb01dent predictions on U are moved to\nL for the next iteration, labeled by the prediction of sign(hw). The algorithm terminates when U is\nempty or all predictions are below a pre-de\ufb01ned con\ufb01dence threshold (and considered unreliable).\nAlgorithm 1 summarizes self-training in pseudo-code with the use of feature selection, described in\nthe following section.\n\nT and U = Du\n\n2\n\n\fAlgorithm 1 SEDA pseudo-code.\n1: Inputs: L and U.\n2: repeat\n3: w\u2217 = argminw(cid:96)(w; L) + \u03b3s(L, U, w)\n4:\n5: Move up-to c con\ufb01dent inputs xi from U to L, labeled as sign(h(xi)).\n6: until No more predictions are con\ufb01dent\n7: Return hw\u2217\n\nApply hw\u2217 on all elements of U.\n\n3.2 Feature Selection\n\nSo far, we have not addressed that the two data sets U and L are not sampled from the same dis-\ntribution. In domain adaptation, the training data is no longer representative of the test data. More\nexplicitly, PS(Y |X = x) is different from PT (Y |X = x). For illustration, consider the sentiment\nanalysis problem in section 4, where data consists of unigram and bigram bag-of-words features\nand the task is to classify if a book-review (source domain) or dvd-review (target domain) is pos-\nitive or negative. Here, the bigram feature \u201cmust read\u201d is indicative of a positive opinion within\nthe source (\u201cbooks\u201d) domain, but rarely appears in the target (\u201cdvd\u201d) domain. A classi\ufb01er, trained\non the source-dominated set L, that relies too heavily on such features will not make enough high-\nT .\ncon\ufb01dence predictions on the set U = Du\nTo address this issue, we extend the classi\ufb01er with a weighted (cid:96)1 regularization for feature selection.\nThe weights are assigned to encourage the classi\ufb01er to only use features that behave similarly in both\nL and U. Different from previous work on feature selection for domain adaptation [25], where the\ngoal is to \ufb01nd a new representation to minimize the difference between the distributions of the source\nand target domain, what we are proposing is to minimize the difference between the distributions of\nthe labeled training set L and the unlabeled set U (which coincides with the testing set in our setting).\nThis difference is crucial, as it makes the empirical distributions of L and U align gradually. For\nexample, after some iterations, the classi\ufb01er can pick features that are never present in the source\ndomain, but which have entered L through the rote-learning procedure.\nWe perform the feature selection implicitly through w. For a feature \u03b1, let us denote the Pearson\ncorrelation coef\ufb01cient (PCC)1 between feature value x\u03b1 and the label y for all pairs (x, y) \u2208 L as\n\u03c1L(x\u03b1, y). It can be shown that \u03c1L(x\u03b1, y) \u2208 [\u22121, 1] with a value of +1 if a feature is perfectly\nthe feature is the label), 0 if it has no correlation, and \u22121 if it is of\naligned with the label (i.e.\nopposite polarity (i.e.\nthe inverted label). Similarly, let us de\ufb01ne the PCC for all pairs in U as\n\u03c1U ;w(x\u03b1, Y ), where the unknown label Y is a random variable drawn from the conditional proba-\nbility Ph(Y |X; w). The two PCC values indicate how predictive a feature is of the (estimated) class\nlabel in the two respective data sets. Ideally, we would like to choose features that are similarly\npredictive across the two sets. We measure how similarly a feature behaves across L and U with\nthe product \u03c1L(x\u03b1, y)\u03c1U ;w(x\u03b1, Y ). With this notation, we de\ufb01ne the feature weight that re\ufb02ects the\ncross-domain incompatibility of a feature as\n\n\u2206L,U,w(\u03b1) = (1 \u2212 \u03c1L(x\u03b1, y)\u03c1U ;w(x\u03b1, Y )).\n\n(2)\nIt is straight-forward to show that \u2206L,U,w \u2208 [0, 2]. Intuitively, \u2206L,U,w expresses to what degree\nwe would like to remove a feature. A perfect feature, that is the label itself (and the prediction in\nU), results in a score of 0. A feature that is not correlated with the class label in at least one of\nthe two domains (and therefore is too domain-speci\ufb01c) obtains a score of 1. A feature that switches\npolarization across domains (and therefore is \u201cmalicious\u201d) has a score \u2206L,U,w(\u03b1) > 1 (in the\nextreme case if it is the label in L and the inverted label in U, its score would be 2).\nWe incorporate (2) into a weighted (cid:96)1 regularization\n\ns(L, U, w) =\n\n\u2206L,U,w(\u03b1)|w\u03b1|.\n\n(3)\n\nd(cid:88)\n\nIntuitively (3) encourages feature sparsity with a strong emphasis on features with little or oppo-\nsite correlation across the domains, whereas good features that are consistently predictive in both\n\n\u03b1=1\n\n1The PCC for two random variables X, Y is de\ufb01ned as \u03c1 = E[(X\u2212\u00b5X )(Y \u2212\u00b5Y )]\n\n, where \u00b5X denotes the\n\n\u03c3X \u03c3Y\n\nmean and \u03c3X the standard deviation of X.\n\n3\n\n\fdomains become cheap. We refer to this version of the algorithm as Self-training for Domain Adap-\ntation (SEDA). The optimization with feature selection, used in Algorithm 1, becomes\n\nw = argminw(cid:96)(L) + \u03b3s(L, U, w).\n\n(4)\nHere, \u03b3 \u2265 0 denotes the loss-regularization trade-off parameter. As we have very few labeled inputs\nfrom the target domain in the early iterations, stronger regularization is imposed so that only features\nshared across the two domains are used. When more and more inputs from the target domain are\nincluded in the training set, we gradually decrease the regularization to accommodate target speci\ufb01c\nfeatures. The algorithm is very insensitive to the exact initial choice of \u03b3. The guideline is to start\nwith a relatively large number, and decrease it until the selected feature set is not empty. In our\nimplementation, we set it to \u03b30 = 0.1, and we divide it by a factor of 1.1 during each iteration.\n\n3.3 Co-training for Domain Adaptation\n\nFor rote-learning to be effective, we need to move test inputs from U to L that 1) are correctly\nclassi\ufb01ed (with high probability) and 2) have potential to improve the classi\ufb01er in future iterations.\nThe former is addressed by the feature selecting regularization from the previous section \u2013 restricting\nthe classi\ufb01er to a sub-set of features that are known to be cross-data set compatible reduces the\ngeneralization error on U. In this section we address the second requirement. We want to add inputs\nxi that contain additional features, which were not used to obtain the prediction hw(xi) and would\nenrich the training set L.\nIf the exact labels of the inputs in U were known, a good active learning [26] strategy would be to\nmove inputs to L on which the current classi\ufb01er hw is most uncertain. In our setting, this would\nbe clearly ill advised as the uncertain prediction is also used as the label. A natural solution to this\ndilemma is co-training [7]. Co-training assumes the data set is presented in two separate views and\ntwo classi\ufb01ers are trained, one in each view. Each iteration, only inputs that are con\ufb01dent according\nto exactly one of the two classi\ufb01ers are moved to the training set. This way, one classi\ufb01er provides\nthe (estimated) labels to the inputs on which the other classi\ufb01er is uncertain.\nIn our setting we do not have multiple views and which features are selected varies in each iteration.\nHence, co-training does not apply out-of-the-box. We can, however, split our features into two mu-\ntually exclusive views such that co-training is effective. To this end we follow the pseudo-multiview\nregularization introduced by Chen et al. [10]. The main intuition is to train two classi\ufb01ers on a single\nview X such that: (1) both perform well on the labeled data; (2) both are trained on strictly different\nfeatures; (3) together they are likely to satisfy Balcan\u2019s condition of \u0001-expandability [2], a necessary\nand suf\ufb01cient pre-condition for co-training to work2. These three aspects can be formulated explic-\nitly as three modi\ufb01cations of our optimization problem (4). We discuss each of them in detail in the\nfollowing.\nLoss. Two classi\ufb01ers are required for co-training, whose weight vectors we denote by u and v. The\nperformance of each classi\ufb01er is measured by the log-loss (cid:96)(\u00b7; L) in eq. (1). To ensure that both\nclassi\ufb01ers perform well on the training set L, i.e. both have a small training loss, we train them\njointly while minimizing the soft-maximum3 of the two losses,\n\n(cid:16)\n\ne(cid:96)(u;L) + e(cid:96)(v;L)(cid:17)\n\nlog\n\n.\n\n(5)\n\nd(cid:88)\n\nFeature Decomposition. Co-training requires the two classi\ufb01ers to be trained on different feature\nspaces. We create those by splitting the feature-space into two mutually exclusive sub-sets. More\nprecisely, for each feature \u03b1, at least one of the two classi\ufb01ers must have a zero weight in the \u03b1th\ndimension. We can enforce this across all features with the equality constraint\n\nu2\n\u03b1v2\n\n\u03b1 = 0.\n\n(6)\n\n\u03b1=1\n\n\u0001-Expandability. In the original co-training formulation [7], it is assumed that the two views of\nthe data are class conditionally independent. This assumption is very strong and can easily be\n\n3The soft-max of a set of elements S is a differentiable approximation of max(S) \u2248 log((cid:80)\n\n2Provided that the classi\ufb01ers are never con\ufb01dent and wrong \u2014 which can be violated in practice.\n\ns\u2208S es).\n\n4\n\n\fviolated in practice [20]. Recent work [2] weakens this requirement signi\ufb01cantly to a condition of\n\u0001-expandability. Loosely phrased, for the two classi\ufb01ers to be able to teach each other, they must\nmake con\ufb01dent predictions on different subsets of the unlabeled set U.\nFor the classi\ufb01er hu, let \u02c6y = sign(u(cid:62)x) \u2208 {\u00b11} denote the class prediction and Ph(\u02c6y|x; u) its\ncon\ufb01dence. De\ufb01ne cu(x) as a con\ufb01dence indicator function (for some con\ufb01dence threshold \u03c4 > 0)4\n\n(cid:26) 1\n\n0\n\ncu(x) =\n\nif p(\u02c6y|x; u) > \u03c4\notherwise,\n\nand cv respectively. Then the \u0001-expanding condition translates to\n\n[cu(x)\u00afcv(x) + \u00afcu(x)cv(x)] \u2265 \u0001 min\n\ncu(x)cv(x),\n\n(cid:88)\n\nx\u2208U\n\n(cid:34)(cid:88)\n\nx\u2208U\n\n(cid:35)\n\n\u00afcu(x)\u00afcv(x)\n\n,\n\n(cid:88)\n\nx\u2208U\n\n(7)\n\n(8)\n\nfor some \u0001 > 0. Here, cu(x) = 1 \u2212 cu(x) indicates that classi\ufb01er hu is not con\ufb01dent about input x.\nIntuitively, the constraint in eq. (8) ensures that the total number of inputs in U that can be used for\nrote-learning because exactly one classi\ufb01er is con\ufb01dent (LHS), is larger than the set of inputs which\ncannot be used because both classi\ufb01ers are already con\ufb01dent or both are not con\ufb01dent (RHS).\nIn summary, the framework splits the feature space into two mutually exclusive sub-sets. This rep-\nresentation enables us to train two logistic regression classi\ufb01ers, both with small loss on the labeled\ndata set, while satisfying two constraints to ensure feature decomposition and \u0001-expandability. Our\n\ufb01nal classi\ufb01er has the weight vector w = u + v. We refer to the resulting algorithm as CODA (Co-\ntraining for Domain Adaptation), which can be stated concisely with the following optimization\nproblem:\n\nlog(cid:0)e(cid:96)(u;L) + e(cid:96)(v;L)(cid:1) + \u03b3s(L, U, w)\nx\u2208U [cu(x)\u00afcv(x) + \u00afcu(x)cv(x)] \u2265 \u0001 min(cid:2)(cid:80)\n\ni=1 u2\n\ni = 0\n\ni v2\n\nmin\nw,u,v\nsubject to:\n\n(1)(cid:80)d\n(2)(cid:80)\n\n(3) w = u + v\n\nx\u2208U cu(x)cv(x),(cid:80)\n\nx\u2208U \u00afcu(x)\u00afcv(x)(cid:3)\n\nThe optimization is non-convex. However, as it is not particularly sensitive to initialization, we set\nu, v randomly and optimize with standard conjugate gradient descent5. Due to space constraints we\ndo not include a pseudo-code implementation of CODA. The implementation is essentially identical\nto that of SEDA (Algorithm 1) where the above optimization problem is solved instead of eq. (4) in\nline 3. In line 5, we move inputs that one classi\ufb01er is con\ufb01dent about while the other one is uncertain\nto the training set L to improve the classi\ufb01er in future iterations.\n\n4 Results\n\nWe evaluate our algorithm together with several other domain adaptation algorithms on the \u201cAmazon\nreviews\u201d benchmark data sets [6]. The data set contains reviews of four different types of products:\nbooks, DVDs, electronics, and kitchen appliances from Amazon.com. In the original dataset, each\nreview is associated with a rating of 1-5 stars. For simplicity, we are only concerned about whether\nor not a review is positive (higher than 3 stars) or negative (3 stars or lower). That is, yi = {+1,\u22121},\nwhere yi = 1 indicates that it is a positive review, and \u22121 otherwise. The data from four domains\nresults in 12 directed adaptation tasks (e.g. books \u2192 dvds). Each domain adaptation task consists\nof 2, 000 labeled source inputs and around 4, 000 unlabeled target test inputs (varying slightly be-\ntween tasks). We let the amount of labeled target data vary from 0 to 1600. For each setting with\ntarget labels we ran 10 experiments with different, randomly chosen, labeled instances. The origi-\nnal feature space of unigrams and bigrams is on average approximately 100, 000 dimensions across\n\n4In our implementation, the 0-1 indicator was replaced by a very steep differentiable sigmoid function, and\n\n\u03c4 was set to 0.8 across different experiments.\n\n5We use minimize.m (http://tinyurl.com/minimize-m).\n\n5\n\n\fdifferent domains. To reduce the dimensionality, we only use features that appear at least 10 times\nin a particular domain adaptation task (with approximately 40, 000 features remaining). Further, we\npre-process the data set with standard tf-idf [24] feature re-weighting.\n\nFigure 1: Relative test-error reduction over logistic regression, averaged across all 12 domain adap-\ntation tasks, as a function of the target training set size. Left: A comparison of the three algorithms\nfrom section 3. The graph shows clearly that self-training (Self-training vs. Logistic Regression),\nfeature-selection (SEDA vs. Self-training) and co-training (CODA vs. SEDA), each improve the\naccuracy substantially. Right: A comparison of CODA with four state-of-the-art domain adaptation\nalgorithms. CODA leads to particularly strong improvements under little target supervision.\n\nAs a \ufb01rst experiment, we compare the three algorithms from Section 3 and logistic regression as a\nbaseline. The results are in the left plot of \ufb01gure 1. For logistic regression, we ignore the difference\nbetween source and target distribution, and train a classi\ufb01er on the union of both labeled data sets.\nWe use (cid:96)2 regularization, and set the regularization constant with 5-fold cross-validation. In \ufb01gure 1,\nall classi\ufb01cation errors are shown relative to this baseline. Our second baseline is self-training,\nwhich adds self-training to logistic regression \u2013 as described in section 3.1. We start with the set\nof labeled instances from source and target domain, and gradually add con\ufb01dent predictions to the\ntraining set from the unlabeled target domain (without regularization). SEDA adds feature selection\nto the self-training procedure, as described in section 3.2. We optimize over 100 iterations of self-\ntraining, at which stage the regularization was effectively zero and the classi\ufb01er converged. For\nCODA we replace self-training with pseudo-multi-view co-training, as described in section 3.3.\nThe left plot in \ufb01gure 1 shows the relative classi\ufb01cation errors of these four algorithms averaged over\nall 12 domain adaptation tasks, under varying amounts of target labels. We observe two trends: First,\nthere are clear gaps between logistic regression, self-training, SEDA, and CODA. From these three\ngaps one can conclude that self-training, feature-selection and co-training each lead to substantial\nimprovements in classi\ufb01cation error. A second trend is that the relative improvement over logistic\nregression reduces as more labeled target data becomes available. This is not surprising, as with\nsuf\ufb01cient target labels the task turns into a classical supervised learning problem and the source data\nbecomes irrelevant.\nAs a second experiment, we compare CODA against three state-of-the-art domain adaptation algo-\nrithms. We refer to these as Coupled, the coupled-subspaces approach [6], EasyAdapt [11], and\nEasyAdapt++. [15]. Details about the respective algorithms are provided in section 5. Coupled\nsubspaces, as described in [6], does not utilize labeled target data and its result is depicted as a\nsingle point. The right plot in \ufb01gure 1 compares these algorithms, relative to logistic regression.\nFigure 3 shows the individual results on all the 12 adaptation tasks with absolute classi\ufb01cation error\nrates. The error bars show the standard deviation across the 10 runs with different labeled instances.\nEasyAdapt and EasyAdapt++, both consistently improve over logistic regression once suf\ufb01cient tar-\nget data is available. It is noteworthy that, on average, CODA outperforms the other algorithms\nin almost all settings when 800 labeled target points or less are present. With 1600 labeled target\npoints all algorithms perform similar to the baseline and additional source data is irrelevant. All\nhyper-parameters of competing algorithms were carefully set by 5-fold cross validation.\nConcerning computational requirements, it is fair to say that CODA is signi\ufb01cantly slower than the\nother algorithms, as each iteration is of comparable complexity as logistic regression or EasyAdapt.\n\n6\n\n05010020040080016000.70.750.80.850.90.9511.05Relative Test ErrorNumber of target labeled data  Logistic RegressionSelf\u2212trainingSEDACODA05010020040080016000.750.80.850.90.9511.051.11.15Relative Test ErrorNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA\fFigure 2: The ratio of the average number of used features between source and target inputs (9),\ntracked throughout the CODA optimization. The three plots show the same statistic at different\namounts of target labels. Initially, an input from the source domain has on average 10-35% more\nfeatures that are used by the classi\ufb01er than a target input. At around iteration 40, this relation changes\nand the classi\ufb01er uses more target-typical features. The graph shows the geometric mean across all\nadaptation tasks. With no target data available (left plot), the early spike in source dominance is\nmore pronounced and decreases when more target labels are available (middle and right plot).\n\nIn typical domain adaptation settings this is generally not a problem, as training sets tend to be small.\nIn our experiments, the average training time for CODA6 was about 20 minutes.\nFinally, we investigate the feature-selection process during CODA training. Let us de\ufb01ne the indi-\ncator function \u03b4(a) \u2208 {0, 1} to be \u03b4(a) = 0 if and only if a = 0, which operates element-wise on\nvectors. The vector \u03b4(w) \u2208 {0, 1}d indicates which features are used in the classi\ufb01er and \u03b4(xi) in-\ndicates which features are present in input xi. We can denote the ratio between the average number\nof used features in labeled training inputs over those in unlabeled target inputs as\n\n(cid:80)\n(cid:80)\n\n\u03b4(w)(cid:62)\u03b4(xs)\n\u03b4(w)(cid:62)\u03b4(xt)\n\n.\n\n(9)\n\nT\n\nr(w) =\n\n1|Dl\nS|\n1|Dl\nT |\n\nS\n\nxs\u2208Dl\nxt\u2208Dl\n\nFigure 2 shows the plot of r(w) for all weight vectors during the 100 iterations of CODA, averaged\nacross all 12 data sets. The three plots show the same statistic under varying amounts of target\nlabels. Two trends can be observed: First, during CODA training, the classi\ufb01er initially selects\nmore source-speci\ufb01c features. For example in the case with zero labeled target data, during early\niterations the average source input contains 20 \u2212 35% more used features relative to target inputs.\nThis source-heavy feature distribution changes and eventually turns into target-heavy distribution as\nthe classi\ufb01er adapts to the target domain. As a second trend, we observe that with more target labels\n(right plot), this spike in source features is much less pronounced whereas the \ufb01nal target-heavy\nratio is unchanged but starts earlier. This indicates that as the target labels increase, the classi\ufb01er\nmakes less use of the source data and relies sooner and more directly on the target signal.\n\n5 Related Work and Discussion\n\nDomain adaptation algorithms that do not use labeled target domain data are sometimes called un-\nsupervised adaptation algorithms. There are roughly three types of algorithms in this group. The\n\ufb01rst type, which includes the coupled subspaces algorithm of Blitzer et al. [5], learns a shared rep-\nresentation under which the source and target distributions are closer than under the ambient feature\nspace [28]. The largest disadvantage of these algorithms is that they do not jointly optimize the\npredictor and the representation, which prevents them from focusing on those features which are\nboth different and predictive. By jointly optimizing the feature selection, the multi-view split and\nthe prediction, CODA allows us to do both.\nThe second type of algorithm attempts to directly minimize the divergence between domains, typ-\nically by weighting individual instances [14, 16, 18]. These algorithms do not assume highly di-\nvergent domains (e.g.\nthose with unique target features), but they have the advantage over both\nCODA and representation-learning of learning asymptotically optimal target predictors from only\n\n6We used a straight-forward MatlabT M implementation.\n\n7\n\n204060801000.911.11.21.31.4IterationsRatio of used featuresSource heavyTarget heavy204060801000.911.11.21.31.4IterationsRatio of used features204060801000.911.11.21.31.4IterationsRatio of used featuresSource heavyTarget heavySource heavyTarget heavy0 target labels400 target labels1600 target labelsRatio of used features (source/target)r(w)Ratio of used features (source/target)\fFigure 3: The individual results on all domain adaptation tasks under varying amounts of labeled\ntarget data. The graphs show the absolute classi\ufb01cation error rates. All settings with existing labeled\ntarget data were averaged over 10 runs (with randomly selected labeled instances). The vertical bars\nindicate the standard deviation in these cases.\n\nsource training data (when their assumptions hold). We did not explore them here because their\nassumptions are clearly violated for this data set.\nIn natural language processing, a \ufb01nal type of very successful algorithm self-trains on its own target\npredictions to automatically annotate new target domain features [19]. These methods are most\nclosely related, in spirit, to our own CODA algorithm. Indeed, our self-training baseline is intended\nto mimic this style of algorithm.\nThe \ufb01nal set of domain adaptation algorithms, which we compared against but did not describe, are\nthose which actively seek to minimize the labeling divergence between domains using multi-task\ntechniques [1, 8, 9, 12, 21, 27]. Most prominently, Daum\u00b4e [11] trains separate source and target\nmodels, but regularizes these models to be close to one another. The EasyAdapt++ variant of this\nalgorithm, which we compared against, generalizes this to the semi-supervised setting by making the\nassumption that for unlabeled target instances, the tasks should be similar. Although these methods\ndid not signi\ufb01cantly out-perform our baselines in the sentiment data set, we note that there do exist\ndata sets on which such multi-task techniques are especially important [11], and we hope soon to\nexplore combinations of CODA with multi-task learning on those data sets.\n\n8\n\n05010020040080016000.10.150.20.250.30.35Test ErrorDvd \u2212> BooksNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorElectronics \u2212> BooksNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorKitchen \u2212> BooksNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorBooks \u2212> DvdNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.20.250.30.35Test ErrorElectronics \u2212> DvdNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorKitchen \u2212> DvdNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorBooks \u2212> ElectronicsNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorDvd \u2212> ElectronicsNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.050.10.150.20.250.3Test ErrorKitchen \u2212> ElectronicsNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.250.30.35Test ErrorBooks \u2212> KitchenNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.10.150.20.25Test ErrorDvd \u2212> KitchenNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA05010020040080016000.080.10.120.140.160.18Test ErrorElectronics \u2212> KitchenNumber of target labeled data  Logistic RegressionCoupledEasyAdaptEasyAdapt++CODA\fReferences\n[1] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. The Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[2] M.F. Balcan, A. Blum, and K. Yang. Co-training and expansion: Towards bridging theory and practice.\n\nNIPS, 17:89\u201396, 2004.\n\n[3] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and Jenn Wortman. A theory of learning\n\nfrom different domains. Machine Learning, 2009.\n\n[4] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adapta-\ntion for sentiment classi\ufb01cation. In Association for Computational Linguistics, Prague, Czech Republic,\n2007.\n\n[5] J. Blitzer, D. Foster, and S. Kakade. Domain adaptation with coupled subspaces.\n\nArti\ufb01cial Intelligence and Statistics, Fort Lauterdale, 2011.\n\nIn Conference on\n\n[6] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning.\nIn Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages\n120\u2013128. Association for Computational Linguistics, 2006.\n\n[7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the\n\neleventh annual conference on Computational learning theory, page 100. ACM, 1998.\n\n[8] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[9] O. Chapelle, P. Shivaswamy, S. Vadrevu, K.Q. Weinberger, Y. Zhang, and B. Tseng. Multi-task learning\nfor boosting with application to web search ranking. In Proceedings of the 16th ACM SIGKDD interna-\ntional conference on Knowledge discovery and data mining, KDD \u201910, pages 1189\u20131198, New York, NY,\nUSA, 2010. ACM.\n\n[10] M. Chen, K.Q. Weinberger, and Y. Chen. Automatic Feature Decomposition for Single View Co-training.\n\nIn International Conference on Machine Learning, 2011.\n\n[11] H. Daume III. Frustratingly easy domain adaptation. In Association for Computational Linguistics, 2007.\n[12] T. Evgeniou, C.A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of\n\nMachine Learning Research, 6(1):615, 2006.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Verlag, New\n\nYork, 2009.\n\n[14] J. Huang, A.J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias\n\nby unlabeled data. In NIPS 19, pages 601\u2013608. MIT Press, Cambridge, MA, 2007.\n\n[15] H. Daume III, A. Kumar, and A. Saha. Co-regularization based semi-supervised domain adaptation. In\n\nNIPS 23, pages 478\u2013486. MIT Press, 2010.\n\n[16] J. Jiang and C.X. Zhai. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th Annual\nMeeting of the Association of Computational Linguistics, pages 264\u2013271, Prague, Czech Republic, June\n2007. Association for Computational Linguistics.\n\n[17] Qian Liu, Aaron Mackey, David Roos, and Fernando Pereira. Evigan: a hidden variable model for\n\nintegrating gene evidence for eukaryotic gene prediction. Bioinformatics, 2008.\n\n[18] T. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS 21,\n\npages 1041\u20131048. MIT Press, 2009.\n\n[19] D. McClosky, E. Charniak, and M. Johnson. Reranking and self-training for parser adaptation. In Pro-\nceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting\nof the Association for Computational Linguistics, pages 337\u2013344. Association for Computational Lin-\nguistics, 2006.\n\n[20] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proceedings\nof the ninth international conference on Information and knowledge management, pages 86\u201393. ACM,\n2000.\n\n[21] S. Parameswaran and K.Q. Weinberger. Large margin multi-task metric learning.\n\n1867\u20131875. 2010.\n\nIn NIPS 23, pages\n\n[22] J.C. Platt et al. Probabilities for sv machines. NIPS, pages 61\u201374, 1999.\n[23] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. Computer\n\nVision\u2013ECCV 2010, pages 213\u2013226, 2010.\n\n[24] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing\n\n& management, 24(5):513\u2013523, 1988.\n\n[25] S. Satpal and S. Sarawagi. Domain adaptation of conditional probability models via feature subsetting.\n\nKnowledge Discovery in Databases: PKDD 2007, pages 224\u2013235, 2007.\n\n[26] B. Settles. Active learning literature survey. Machine Learning, 15(2):201\u2013221, 1994.\n[27] K.Q. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale\nmultitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning,\npages 1113\u20131120. ACM, 2009.\n\n[28] G. Xue, W. Dai, Q. Yang, and Y. Yu. Topic-bridged plsa for cross-domain text classication. In SIGIR,\n\n2008.\n\n9\n\n\f", "award": [], "sourceid": 1317, "authors": [{"given_name": "Minmin", "family_name": "Chen", "institution": null}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": null}, {"given_name": "John", "family_name": "Blitzer", "institution": null}]}