{"title": "Distribution Matching for Transduction", "book": "Advances in Neural Information Processing Systems", "page_first": 1500, "page_last": 1508, "abstract": "Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach.", "full_text": "Distribution Matching for Transduction\n\nNovi Quadrianto\n\nRSISE, ANU & SML, NICTA\n\nCanberra, ACT, Australia\nnovi.quad@gmail.com\n\nJames Petterson\n\nRSISE, ANU & SML, NICTA\n\nCanberra, ACT, Australia\n\njames.petterson@nicta.com.au\n\nAlex J. Smola\nYahoo! Research\n\nSanta Clara, CA, USA\n\nalex@smola.org\n\nAbstract\n\nMany transductive inference algorithms assume that distributions over training\nand test estimates should be related, e.g. by providing a large margin of separation\non both sets. We use this idea to design a transduction algorithm which can be\nused without modi\ufb01cation for classi\ufb01cation, regression, and structured estimation.\nAt its heart we exploit the fact that for a good learner the distributions over the\noutputs on training and test sets should match. This is a classical two-sample\nproblem which can be solved ef\ufb01ciently in its most general form by using distance\nmeasures in Hilbert Space. It turns out that a number of existing heuristics can be\nviewed as special cases of our approach.\n\n1 Introduction\n\n1, . . . , x(cid:48)\n\nTransduction relies on the fundamental assumption that training and test data should exhibit similar\nbehavior. For instance, in large margin classi\ufb01cation a popular concept is to assume that both training\nand test data should be separable with a large margin [4]. A similar matching assumption is made\nby [8, 15] in requiring that class means are balanced between training and test set. Corresponding\ndistributional assumptions are made for classi\ufb01cation by [5], for regression by [10], and in the\ncontext of suf\ufb01cient statistics on the marginal polytope by [3, 6].\nSuch matching assumptions are well founded: after all, we assume that both training data X =\n{x1, . . . , xm} \u2286 X and test data X(cid:48) := {x(cid:48)\nm(cid:48)} \u2286 X are drawn independently and identically\ndistributed from the same distribution p(x) on a domain X. It therefore follows that for any function\n(or set of functions) f : X \u2192 R the distribution of f(x) where x \u223c p(x) should also behave in the\nsame way on both training and test set. Note that this is not automatically true if we get to choose f\nafter seeing X and X(cid:48).\nRather than indirectly incorporating distributional similarity, e.g. by a large margin heuristic, we\ncast this goal as a two-sample problem which will allow us to draw on a rich body of literature for\ncomparing distributions. One advantage of our setting is its full generality. That is, it is applicable\nwithout much need for customization to all estimation problems, whether structured or not. Further-\nmore, our approach is scalable and can be used easily with online optimization algorithms requiring\nno additional storage and only an additional O(1) computation per observation. This allows us to\nperform a multi-category classi\ufb01cation on a dataset with 3.2\u00b7106 observations. At its heart it uses the\nfollowing: rather than minimizing only the empirical risk, regularized risk, log-posterior, or related\nquantities obtained only on the training set, let us add a divergence term characterizing the mismatch\nin distributions between training and test set. We show that the Maximum-Mean-Discrepancy [7] is\na suitable quantity for this purpose. Moreover, we show that for certain choices of kernels we are\nable to recover a number of existing transduction constraints as a special case.\nNote that our setting is entirely complementary to the notion of modifying the function space due\nto the availability of additional data. The latter stream of research led to the use of graph kernels\nand similar density-related algorithms [1]. It is often referred to as the cluster assumption in semi-\nsupervised learning. In other words, both methods can be combined as needed. That said, while\n\n1\n\n\fdistribution matching always holds thus making our method always applicable, it is not entirely\nclear whether the cluster assumption is always satis\ufb01ed (e.g. assume a noisy classi\ufb01cation problem).\nDistribution matching, however, comes with a nontrivial price: the objective of the optimization\nproblem ceases to be convex except for rather special cases (which correspond to algorithms that\nhave been proposed as previous work). While this is a downside, it is a property inherent in most\ntransduction algorithms \u2014 after all, we are dealing with algorithms to obtain self-consistent label-\nings, predictions, or regressions on the data and there may exist more than one potential solution.\n\n2 The Model\n\nSupervised Learning Denote by X and Y the domains of data and labels and let Pr(x, y) be a\ndistribution on X \u00d7 Y from which we are drawing observations. Moreover, denote by X, Y sets of\ndata and labels of the training set and by X(cid:48), Y (cid:48) test data and labels respectively. In general, when\ndesigning an estimator one attempts to minimize some regularized risk functional\n\nRreg[f, X, Y ] :=\n\n1\nm\n\nl(xi, yi, f) + \u03bb\u2126[f]\n\nor alternatively (in a Bayesian setting) one deals with a log-posterior probability\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\n(1)\n\n(2)\n\nlog p(f|X, Y ) =\n\nlog p(yi|xi, f) + log p(f) + const.\n\ni=1\n\nHere p(f) is the prior of the parameter choice f and p(yi|xi, f) denotes the likelihood. f typically\nis a mapping X \u2192 R (for scalar problems such as regression or classi\ufb01cation) or X \u2192 Rd (for\nmultivariate problems such as named entity tagging, image annotation, matching, ranking, or more\ngenerally the clique potentials of graphical models). Note that we are free to choose f from one\nof many function classes such as decision trees, neural networks, or (nonparametric) linear models.\nThe speci\ufb01c choice boils down to the ability to control the complexity of f ef\ufb01ciently, to one\u2019s prior\nknowledge of what constitutes a simple function, to runtime constraints, and to the availability of\nscalable algorithms. In general, we will denote the training-data dependent term by\n\n(3)\nand we assume that \ufb01nding some f for which Rtrain[f, X, Y ] is small is desirable. An analogous\nreasoning applies to sampling-based algorithms, however we skip them for the sake of conciseness.\n\nRtrain[f, X, Y ]\n\n1), . . . , f(x(cid:48)\n\n:= {f(x1), . . . , f(xm)} and by f(X(cid:48))\n\nDistribution Matching Denote by f(X)\n:=\n{f(x(cid:48)\nm(cid:48))} the applications of our estimator (and any related quantities) to training and\ntest set respectively. For f chosen a-priori, the distributions from which f(X) and f(X(cid:48)) are drawn\ncoincide. Clearly, this should also hold whenever f is chosen by an estimation process. After all,\nwe want that the empirical risk on the training and test sets match. While this cannot be checked\ndirectly, we can at least check closeness between the distributions of f(x). This reasoning leads us\nto the following additional term for the objective function of a transduction problem:\n\n(4)\nHere D(f(X), f(X(cid:48))) denotes the distance between the two distributions f(X) and f(X(cid:48)). This\nleads to an overall objective for learning\n\nD(f(X), f(X(cid:48)))\n\nRtrain[f, X, Y ] + \u03b3D(f(X), f(X(cid:48))) for some \u03b3 > 0\n\n(5)\nwhen performing transductive inference. For instance, we could use the Kolmogorov-Smirnov statis-\ntic between both sets as our criterion, that is, we could use\n\nD(f(X), f(X(cid:48))) = (cid:107)F (f(X)) \u2212 F (f(X(cid:48)))(cid:107)\u221e\n\n(6)\nthe L\u221e norm between the cumulative distribution functions F associated with the empirical distri-\nbutions f(X) and f(X(cid:48)) to quantify the differences between both distributions. The problem with\nthe above choice of distance is that it is not easily computable: we \ufb01rst need to evaluate f on both X\nand X(cid:48), then sort the arguments, and \ufb01nally compute the largest deviation between both sets before\n\n2\n\n\fwe can even attempt computing gradients or using a similar optimization procedure. Such a choice\nis clearly computationally undesirable.\nInstead, we propose the following: denote by H a Reproducing Kernel Hilbert Space with kernel k\nde\ufb01ned on X. In this case one can show [7] that whenever k is characteristic (or universal), the map\n(7)\ncharacterizes a distribution uniquely. Examples of a characteristic kernel is Gaussian RBF, Laplacian\nand B2n+1-splines. It is possible to design online estimates of the distance quantity which can be\nused for fast two-sample tests between \u00b5[X] and \u00b5[X(cid:48)]. Details on how this can be achieved are\ndeferred to Section 4.\n\n\u00b5 : p \u2192 \u00b5[p] := Ex\u223cp(x)[k(x,\u00b7)] with associated distance D(p, p(cid:48)) := (cid:107)\u00b5[p] \u2212 \u00b5[p(cid:48)](cid:107)2\n\n3 Special Cases\n\nBefore discussing a speci\ufb01c algorithm let us consider a number of special cases to show that this\nbasic idea is rather common in the literature (albeit not as explicit as in the present paper).\n\nMean Matching for Classi\ufb01cation Joachims [8] uses the following balancing constraint in the\nobjective function of a binary classi\ufb01er where \u02c6y(x) = sgn(f(x)) for f(x) = (cid:104)w, x(cid:105). In order to\nbalance the outputs between training and test set, [8] imposes the linear constraint\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nm(cid:88)\n\ni=1\n\nm(cid:48)(cid:88)\nm(cid:48)(cid:88)\n\ni=1\n\ni=1\n\nm(cid:48)(cid:88)\n\ni=1\n\nf(xi) =\n\n1\nm(cid:48)\n\nf(x(cid:48)\ni).\n\nAssuming a linear kernel k on R this constraint is equivalent to requiring that\n\n\u00b5[f(X)] =\n\n1\nm\n\n(cid:104)f(xi),\u00b7(cid:105) =\n\n1\nm(cid:48)\n\n(cid:104)f(x(cid:48)\n\ni),\u00b7(cid:105) = \u00b5[f(X(cid:48))].\n\n(8)\n\n(9)\n\nNote that [8] uses the margin distribution as an additional criterion which will be discussed later.\nThis setting can be extended to multiclass categorization and estimation with structured random\nvariables in a straightforward fashion [15] simply by requiring a constraint corresponding to (9) to\nbe satis\ufb01ed for all possible values of y via\n(cid:104)f(xi, y),\u00b7(cid:105) =\n\ni, y),\u00b7(cid:105) for all y \u2208 Y.\n\nm(cid:48)(cid:88)\n\nm(cid:88)\n\n(cid:104)f(x(cid:48)\n\n(10)\n\n1\nm\n\ni=1\n\n1\nm(cid:48)\n\ni=1\n\nThis is equivalent to a linear kernel on RY and the requirement that the distributions of the values\nf(x, y) match for all y.\n\nDistribution Matching for Classi\ufb01cation G\u00a8artner et. al. [5] propose to perform transduction by\nrequiring that the conditional class probabilities on training and test set match. That is, for classi\ufb01ers\ngenerating a distribution of the form y(cid:48)\ni, w) they require that the marginal class probability\non the test set matches the empirical class probability on the training set. Again, this can be cast in\nterms of distribution matching via\n\ni \u223c p(y(cid:48)\n\ni|x(cid:48)\n\n\u00b5[g \u25e6 f(X)] =\n\n1\nm\n\n(cid:104)g \u25e6 f(xi),\u00b7(cid:105) =\n\n1\nm(cid:48)\n\n(cid:104)g \u25e6 f(x(cid:48)\n\ni),\u00b7(cid:105) = \u00b5[g \u25e6 f(X(cid:48))]\n\n1\n\n1+e\u2212\u03c7 denotes the likelihood of y = 1 in logistic regression for the model p(y|\u03c7) =\nHere g(\u03c7) = 1\n1+e\u2212y\u03c7 . Note that instead of choosing the logistic transform g we could have picked a large number\nof other transformations. Indeed, we may strengthen the requirement above to hold for all g in some\ngiven function class G as follows:\n\nm(cid:88)\n\ni=1\n\nD(f(X), f(X(cid:48))) := sup\ng\u2208G\n\ng \u25e6 f(xi) \u2212 1\nm(cid:48)\n\ng \u25e6 f(x(cid:48)\ni)\n\n(11)\n\nIf we restrict ourselves to g having bounded norm in a Reproducing Kernel Hilbert Space we obtain\nexactly the criterion (7). Gretton et. al. [7] show by duality that this is equivalent to the distance\nproposed in (11). In other words, generalizing distribution matching to apply to transforms other\nthan the logistic leads us directly to our new transduction criterion.\n\n3\n\n\uf8ee\uf8f0 1\n\nm\n\nm(cid:88)\n\ni=1\n\n\uf8f9\uf8fb\n\nm(cid:48)(cid:88)\n\ni=1\n\n\fFigure 1: Score distribution of f(x) = (cid:104)w, x(cid:105) + b on the \u2019iris\u2019 toy dataset. From left to right:\ninduction scores on the training set; test set; transduction scores on the training set; test set; Note\nthat while the margin distributions on training and test set are very different for induction, the ones\nfor transduction match rather well. It results in a 10% reduction of the misclassi\ufb01cation error.\n\nDistribution Matching for Regression A similar idea for transduction was proposed by [10] in\nthe context of regression: requiring that both means and predictive variances of the estimate agree\nbetween training and test set. For a heteroscedastic regression estimate this constraint between train-\ning and test set is met simply by ensuring that the distributions over \ufb01rst and second order moments\nof a Gaussian exponential family distribution match. The same goal can be achieved by using a\npolynomial kernel of second degree on the estimates, which shows that regression transduction can\nbe viewed as a special case.\n\nLarge Margin Hypothesis A key assumption in transduction is that a good hypothesis is charac-\nterized by a large margin of separation on both training and test set. Typically, the latter is enforced\nby some nonconvex function, e.g. of the form max(0, 1\u2212|f(x)|), thus leading to a nonconvex opti-\nmization problem. Generalizations of this approach to multiclass and structured estimation settings\nis not entirely trivial and requires a number of heuristic choices (e.g. how to de\ufb01ne the equivalent of\nthe hat function max(0, 1 \u2212 |\u03c7|) that is commonly used in binary transduction).\nInstead, if we require that the distribution of values f(x,\u00b7) on X(cid:48) match those on X, we auto-\nmatically obtain a loss function which enforces the large margin hypothesis whenever it is actually\nachievable on the training set. After all, assume that f(X) exhibits a large margin of separation\nwhereas f(X(cid:48)) does not. In this case, D(f(X), f(X(cid:48))) is large and we obtain better risk minimiz-\ners by minimizing the discrepancy of the distributions. The key point is that by using a two-sample\ncriterion it is possible to obtain such criteria automatically without the need for heuristic choices.\nSee Figure 1 for illustrations of this idea.\n\n4 Algorithm\nStreaming Approximation In general, minimizing D(f(X), f(X(cid:48))) is computationally infeasi-\nble since the estimation of the distributional distance requires access to f(X) and f(X(cid:48)) rather than\nevaluations on a small sample. However, for Hilbert-Space based distance measures it is possible to\n\ufb01nd an online estimate of D as follows [7]:\n\nD(p, p(cid:48)) := (cid:107)\u00b5[p] \u2212 \u00b5[p(cid:48)](cid:107)2 =(cid:13)(cid:13)Ex\u223cp(x)[k(x,\u00b7)] \u2212 Ex(cid:48)\u223cp(cid:48)(x(cid:48))[k(x(cid:48),\u00b7)](cid:13)(cid:13)\n\n(12)\n(13)\nThe symbol \u02dc(.) denotes a second set of observations drawn from the same distribution. Note that\n(13) decomposes into a sum over 4 kernel functions, each of which takes as arguments a pair of\ninstances drawn from p and p(cid:48) respectively. Hence we can \ufb01nd an unbiased estimate via\n\n= Ex,\u02dcx\u223cpEx(cid:48),\u02dcx(cid:48)\u223cp(cid:48)[k(x, \u02dcx) \u2212 k(x, \u02dcx(cid:48)) \u2212 k(\u02dcx, x(cid:48)) + k(x(cid:48), \u02dcx(cid:48))]\n\nDi where\n\n\u02c6D :=\nDi := [k(f(xi), f(xi+1)) \u2212 k(f(xi), f(x(cid:48)\n(14)\nunder the assumption that X and X(cid:48) contain iid data. Note that the assumption automatically fails\nif there is sequential dependence within the sets X or X(cid:48) (e.g. we see all positive labels before we\nsee the negative ones). In this case it is necessary to randomize X and X(cid:48).\n\ni+1)) \u2212 k(f(xi+1), f(x(cid:48)\n\ni)) + k(f(x(cid:48)\n\ni), f(x(cid:48)\n\ni+1))]\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n4\n\n\fStochastic Gradient Descent The fact that the estimator of the distance \u02c6D decomposes into an\naverage over a function of pairs from the training and test set respectively means that we can use Di\nas a stochastic approximation. Applying the same reasoning to the loss function in the regularized\nrisk (1) we obtain the following loss\ni, x(cid:48)\n\n\u00afl(xi, xi+1, yi, yi+1, x(cid:48)\n\ni+1, f)\n\n(15)\n\n:= l(xi, yi, f) + l(xi+1, yi+1, f) + 2\u03bb\u2126[f]+\n\n\u03b3[k(f(xi), f(xi+1)) \u2212 k(f(xi), f(x(cid:48)\n\ni+1)) \u2212 k(f(xi+1), f(x(cid:48)\n\ni)) + k(f(x(cid:48)\n\ni), f(x(cid:48)\n\ni+1))]\n\nas a stochastic estimate of the objective function de\ufb01ned in (5). This suggests Algorithm 1, which is\na nonconvex variant of [12]. Note that at no time we need to store past data even for computing the\ndistance between both distributions.\n\nAlgorithm 1 Stochastic Gradient Descent\n\nInput: Convex set A, objective function \u00afl\nInitialize w = 0\nfor t = 1 to N do\n\nend for\n\nSample (xi, yi), (xi+1, yi+1) \u223c p(x, y) and x(cid:48)\ni, x(cid:48)\nUpdate w \u2190 w \u2212 \u03b7t\u2202w\ni, x(cid:48)\n\u00afl(xi, xi+1, yi, yi+1, x(cid:48)\nProject w onto A via w \u2190 argmin \u00afw\u2208A (cid:107)w \u2212 \u00afw(cid:107).\n\ni+1 \u223c p(x)\ni+1, f) where f(x) = (cid:104)\u03c6(x), w(cid:105)\n\nRemark: The streaming formulation does not impose any in-principle limitation regarding matching\nsample sizes. The only difference is that in the unmatched case we want to give samples from\nboth distributions different weights (1/m and 1/m\u2019 respectively), e.g. by modifying the sampling\nprocedure (see Table 3, Section 5).\n\nDC Programming Alternatively, the Concave Convex Procedure, best known as DC program-\nming in optimization [2], can be used to \ufb01nd an approximate solution of the problem in (5) by\nsolving a succession of convex programs. DC programming has been used extensively in almost\nany other transductive algorithms to deal with non-convexity of the objective function. It works as\nfollows: for a given function F (x) that can be written as a difference of two convex functions G and\nH via F (x) = G(x) \u2212 H(x), the below inequality\n\nF (x) \u2264 \u00afF (x, x0) := G(x) \u2212 H(x0) \u2212 (cid:104)x \u2212 x0, \u2202xH(x0)(cid:105)\n\n(16)\nholds for all x0 with equality for x = x0, due to the convexity of H(x). This implies an iterative\nalgorithm for \ufb01nding a local minimum of F by minimizing the upper bound \u00afF (x, x0) and subse-\nquently updating x0 \u2190 argminx F (x, x0) to the minimizer of the upper bound.\nIn order to minimize an additively decomposable objective function as in our transductive estima-\ntion, we could use stochastic gradient descent on the convex upper bound. Note that here the convex\nupper bound is given by a sum over the convex upper bounds for all terms. This strategy, how-\never, is de\ufb01cient in a signi\ufb01cant aspect: the convex upper bounds on each of the loss terms become\nincreasingly loose as we move f away from the current point of approximation. It would be con-\nsiderably better if we updated the upper bound after every stochastic gradient descent step. This\nvariant, however, is identical to stochastic gradient descent on the original objective function due to\nthe following:\n\n\u2202xF (x)|x=x0 = \u2202x \u00afF (x, x0)|x=x0 = \u2202xG(x)|x=x0 \u2212 \u2202xH(x)|x=x0 for all x0.\n\n(17)\nIn other words, in order to compute the gradient of the upper bound we need not compute the upper\nbound itself. Instead we may use the nonconvex objective directly, hence we did not pursue DC\nprogramming approach and Algorithm 1 applies.\n\n5 Experiments\n\nTo demonstrate the applicability of our approach, we apply transduction to binary and multiclass\nclassi\ufb01cation both on toy datasets from the UCI repository [16] and the LibSVM site [17], plus\n\n5\n\n\fa larger scale multi-category classi\ufb01cation dataset with 3.2 \u00b7 106 observations. We also perform\nexperiments on a structured estimation problem, i.e. Japanese named entity recognition task and\nCoNLL-2000 base NP chunking task.\n\nAlgorithms Since we are not aware of other transductive algorithms which can be applied easily\nto all the problems we consider, we choose problem-speci\ufb01c transduction algorithms as competitors.\nMulti Switch Transductive SVM (MultiSwitch) is used for binary classi\ufb01cation [14]. This method\nis a variant of transductive SVM algorithm [8] tailored for linear semi-supervised binary classi\ufb01-\ncation on large and sparse datasets and involves switching of more than a single pair of labels at a\ntime. For multiclass categorization we pick a Gaussian processes based transductive algorithm with\ndistribution matching term (GPDistMatch) [5].\nWe use stochastic gradient descent for optimization in both inductive and transductive settings for\nbinary and multiclass losses. More speci\ufb01cally, for transduction we use the Gaussian RBF kernel to\ncompare distributions in (14). Note that, in the multiclass case, the additional distribution matching\nterm measures the distance between multivariate functions.\n\nSmall Scale Experiments We used the following datasets: binary (breastcancer, derm, optdigits,\nwdbc, ionosphere, iris, specft, pageblock, tae, heart, splice, adult, australian, bupa, cmc, german,\npima, tic, yeast, sonar, cleveland, svmguide3 and musk) from the UCI repository and multiclass\n(usps, satimage, segment, svmguide2, vehicle). The data was preprocessed to have zero mean and\nunit variance.\nSince we anticipate the relevant length scale in the margin distribution to be in the order of 1 (after\nall, we use a loss function, i.e. a hinge loss, which uses a margin of 1) we pick a Gaussian RBF\n\u221a\nkernel width of 0.2 for binary classi\ufb01cation. Moreover, to take scaling in the number of classes\ninto account we choose a kernel width of 0.1\nc for multicategory classi\ufb01cation. Here c denotes the\nnumber of classes. We could indeed vary this width but we note in our experiments that the proposed\nmethod is not sensitive to this kernel width.\nWe split data equally into training and test sets, performing model selection on the training set and\nassessing performance on the test set. In these small scale experiments, we tune hyperparameters via\n5-fold cross validation on the entire training set. The whole procedure was then repeated 5 times to\nobtain con\ufb01dence bounds. More speci\ufb01cally, in the model selection stage, for transduction we adjust\nthe regularization \u03bb and the transductive weight term \u03b3 (obviously, for inductive inference we only\nneed to adjust \u03bb). For MultiSwitch Transduction the positive class fraction of unlabeled data was\nestimated using the training set [14]. Likewise, the two associated regularization parameters were\ntuned on the training set. For GP transduction both the regularization and divergence parameters\nwere adjusted.\n\nResults The experimental results are summarized in Figure 2 for a binary setting and in Table\n1 for a multiclass problem. In 23 binary datasets, transduction outperforms the inductive setup in\n20 of them. Arguably, our proposed transductive method performs on a par with state-of-the-art\ntransductive approach for each learning problem. In the binary estimation, out of 23 datasets, our\nmethod performs signi\ufb01cantly worse than MultiSwitch transduction algorithm in 4 datasets (adult,\nbupa, pima, and svmguide3) and signi\ufb01cantly better on 2 datasets (ionosphere and pageblock), using\na one-sided paired t-test with 95% con\ufb01dence. Overall, both algorithms are very comparable. The\nadvantage of our approach is that it is \u2018plug and play\u2019, i.e. for different problems we only need\nto use the appropriate supervised loss function. The distribution matching penalty itself remains\nunchanged. Further, by casting the transductive solution as an online optimization method, our\napproach scales well.\n\nLarger Scale Experiments Since one of the key points of our approach is that it can be applied\nto large problems, we performed transduction on the DMOZ ontology [20] of topics. We selected\nthe top 2 levels of the topic tree (575) and removed all but the 100 most frequent ones, since a\nlarge number of topics occurs only very rarely. This left us with 89.2% of the initial webpages.\nAs feature vectors we used the standard bag of words representation of the web page descriptions\nwith TF-IDF weighting. The dictionary size (and therefore the dimensionality of our features) is\n\n6\n\n\fFigure 2: Error rate on 23 binary estimation problems. Left panel, DistMatch against Induction;\nRight panel, DistMatch against MultiSwitch. DistMatch: distribution matching (ours) and\nMultiSwitch: Multi switch transductive SVM, [14]. Height of the box encodes standard er-\nror of DistMatch and width of the box encodes standard error of Induction / MultiSwitch.\n\nTable 1: Error rate \u00b1 standard deviation on a multi-category estimation problem. DistMatch:\ndistribution matching (ours) and GPDistMatch: Gaussian Process transduction, [5].\ndataset\nusps\nsatimage\nsegment\nsvmguide2\nvehicle\n\nGPDistMatch\n0.140\u00b10.034\n0.212\u00b10.034\n0.181\u00b10.020\n0.231\u00b10.018\n0.336\u00b10.060\n\nDistMatch\n0.125\u00b10.019\n0.186\u00b10.037\n0.206\u00b10.047\n0.256\u00b10.020\n0.333\u00b10.048\n\nInduction\n0.143\u00b10.021\n0.190\u00b10.052\n0.279\u00b10.090\n0.280\u00b10.028\n0.385\u00b10.070\n\nm classes\n10\n6\n7\n3\n4\n\n730\n620\n693\n391\n423\n\nTable 2: Error rate on the DMOZ ontology for increasing training / test set sizes.\ntraining / test set size\ninduction\ntransduction\n\n800,000\n0.300\n0.263\n\n400,000\n0.299\n0.288\n\n200,000\n0.337\n0.330\n\n50,000\n0.365\n0.344\n\n100,000\n0.362\n0.326\n\n1,600,000\n0.268\n0.250\n\nTable 3: Error rate on the DMOZ ontology for \ufb01xed training set size of 100,000 samples.\ntest set size\ninduction\ntransduction\n\n1,600,000\n0.357\n0.329\n\n200,000\n0.358\n0.316\n\n100,000\n0.358\n0.326\n\n400,000\n0.357\n0.306\n\n800,000\n0.357\n0.322\n\nTable 4: Accuracy, precision, recall and F\u03b2=1 score on the Japanese named entity task.\n\ninduction\ntransduction\n\nAccuracy\n96.82\n97.13\n\nPrecision Recall\n72.49\n75.30\n\n84.15\n84.46\n\nF1 Score\n77.89\n79.62\n\nTable 5: Accuracy, precision, recall and F\u03b2=1 score on the CoNLL-2000 base NP chunking task.\n\ninduction\ntransduction\n\nAccuracy\n95.72\n96.05\n\nPrecision Recall\n90.72\n91.97\n\n90.99\n91.73\n\nF1 Score\n90.85\n91.85\n\n1,319,489. For these larger scale experiments, we use a dataset of up to 3.2 \u00b7 106 observations. To\nour knowledge, our proposed transduction method is the only one that scales very well due to the\nstochastic approximation.\nFor each experiment, we split data into training and test sets. Model selection is perform on the\ntraining set by putting aside part of the training data as a validation set which is then used exclusively\nfor tuning the hyperparameters. In large scale transduction two issues matter: \ufb01rstly, the algorithm\nneeds to be scalable with respect to the training set size. Secondly, we need to be able to scale the\nalgorithm with respect to the test set. Both results can be seen in Tables 2 and 3. Note that Table 2\nuses an equal split between training and test sets, while Table 3 uses an unequal split where the test\n\n7\n\n\fset has many more observations. We see that the algorithm improves with increasing data size, both\nfor training and test sets. In the latter case, only up to some point: for the larger test sets (800,000\nand 1,600,000) it decreases (although still stays better than inductive\u2019s). We suspect that a location-\ndependent transduction score would be useful in this context \u2013 i.e. instead of only minimizing the\ndiscrepancy between decision function values on training and test set D(f(X), f(X(cid:48))) we could\nalso introduce local features D((X, f(X)), (X(cid:48), f(X(cid:48)))).\n\nJapanese Named Entity Recognition Experiments A key advantage of our transduction algo-\nrithm is it can be applied to structured estimation without modi\ufb01cation. We used the Japanese\nnamed-entity recognition dataset provided with the CRF++ toolkit [18]. The data contains 716\nJapanese sentences with 17 annotated named entities. The task is to detect and classify proper nouns\nand numerical information in a document into categories such as names of persons, organizations,\nlocations, times and quantities. Conditional random \ufb01elds (CRFs) [9] are considered to be the state-\nof-the-art framework for this sequential labeling problem [11].\nAs the basis of our implementation we used Leon Bottou\u2019s CRF code [19]. We use simple 1D chain\nCRFs with \ufb01rst order Markov dependency between name tags. That is, we have clique potentials\njoining adjacent labels (yi, yi+1), but which are independent of the text itself, and clique potentials\njoining words and labels (xi, yi). Since the former do not depend on the test data there is no need\nto enforce distribution matching. For the latter, though, we want to enforce that clique potentials\nare distributed in the same way between training and test set. The stationarity assumption in the\npotentials implies that this needs to hold uniformly over all such cliques.\nSince the number of tokens per sentence is variable, i.e. the chain length itself is a random variable,\nwe perform distribution matching on a per-token basis \u2014 we oversample each token 10 times in our\nexperiments. This strikes a balance between statistical accuracy and computational ef\ufb01ciency. The\nadditional distribution matching term is then measuring the distance between these over-sampled\nclique potentials. As before, we split data equally into training and test sets and put aside part of\nthe training data as a validation set which is used exclusively for tuning the hyperparameters. We\nrelied on the feature template provided in CRF++ for this task. We report results in Table 4, that is\nprecision (fraction of name tags which match the reference tags), recall (fraction of reference tags\nreturned), and their harmonic mean, F\u03b2=1 are reported. Transduction outperforms induction in all\nmetrics.\n\nCoNLL-2000 Base NP Chunking Experiments Our second structured estimation experiment is\nthe CoNLL-2000 base NP chunking dataset [13] as provided in the CRF++ toolkit. The task is to\ndivide text into syntactically correlated parts. The dataset has 900 sentences and the goal is to label\neach word with a label indicating whether the word is outside a chunk, starts a chunk, or continues\na chunk.\nSimilarly to Japanese named entity recognition task, 1D chain CRFs with only \ufb01rst order Markov\ndependency between chunk tags are modeled. We considered binary-valued features which depend\non the words, part-of-speech tags, and labels in the neighborhood of a given word as encoded in\nthe CRF++ feature template. The same experimental setup as in named entity experiments is used.\nThe results in terms of accuracy, precision, recall and F1 score are summarized in Table 5. Again,\ntransduction outperforms the inductive setup.\n\n6 Summary and Discussion\n\nWe proposed a transductive estimation algorithm which is a) simple, b) general c) scalable and d)\nworks well when compared to the state of the art algorithms applied to each speci\ufb01c problem. Not\nonly is it useful for classical binary and multiclass categorization problems but it also applies to\nontologies and structured estimation problems. It is not surprising that it performs very comparably\nto existing algorithms, since they can, in many cases, be seen as special instances of the general\npurpose distribution matching setting.\nExtensions of distribution matching beyond simply modeling f(X) and instead, modeling\n(X, f(X)), that is, the introduction of local features, obtaining good theoretical bounds on the\nshrinkage of the function class via the distribution matching constraint, and applications to other\nfunction classes (e.g. balancing decision trees) are subject of future research.\n\n8\n\n\fReferences\n[1] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,\n\nCambridge, MA, 2006.\n\n[2] T. Pham Dinh and L. Hoai An. A D.C. optimization algorithm for solving the trust-region\n\nsubproblem. SIAM Journal on Optimization, 8(2):476\u2013505, 1988.\n\n[3] G. Druck, G.S. Mann, and A. McCallum. Learning from labeled features using generalized\nexpectation criteria. In S.-H. Myaeng, D.W. Oard, F. Sebastiani, T.-S. Chua, and M.-K. Leong,\neditors, SIGIR, pages 595\u2013602. ACM, 2008.\n\n[4] A. Gammerman, Volodya Vovk, and Vladimir Vapnik. Learning by transduction. In Proceed-\n\nings of Uncertainty in AI, pages 148\u2013155, Madison, Wisconsin, 1998.\n\n[5] T. G\u00a8artner, Q.V. Le, S. Burton, A. J. Smola, and S. V. N. Vishwanathan. Large-scale multiclass\ntransduction. In Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Information\nProcessing Systems 18, pages 411\u2013418, Cambride, MA, 2006. MIT Press.\n\n[6] J. Grac\u00b8a, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In\n\nJ. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, NIPS. MIT Press, 2007.\n\n[7] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the\n\ntwo sample problem. Technical Report 157, MPI for Biological Cybernetics, 2008.\n\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\nI. Bratko and S. Dzeroski, editors, Proc. Intl. Conf. Machine Learning, pages 200\u2013209, San\nFrancisco, 1999. Morgan Kaufmann Publishers.\n\n[9] J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic modeling\nfor segmenting and labeling sequence data. In Proc. Intl. Conf. Machine Learning, volume 18,\npages 282\u2013289, San Francisco, CA, 2001. Morgan Kaufmann.\n\n[10] Q.V. Le, A.J. Smola, T. G\u00a8artner, and Y. Altun. Transductive gaussian process regression with\nautomatic model selection. In J. F\u00a8urnkranz, T. Scheffer, and M. Spiliopoulou, editors, Euro-\npean Conference of Machine Learning, volume 4212 of LNAI. 306-317, 2006.\n\n[11] A. McCallum and W. Li. Early results for named entity recognition with conditional random\n\n\ufb01elds, feature induction and web enhanced lexicons. In CoNLL, 2003.\n\n[12] Y. Nesterov and J.-P. Vial. Con\ufb01dence level solutions for stochastic programming. Techni-\ncal Report 2000/13, Universit\u00b4e Catholique de Louvain - Center for Operations Research and\nEconomics, 2000.\n\n[13] E.F. Tjong Kim Sang and S. Buchholz. Introduction to the CoNLL-2000 shared task: Chunk-\ning. In Proc. Conf. Computational Natural Language Learning, pages 127\u2013132, Lisbon, Por-\ntugal, 2000.\n\n[14] V. Sindhwani and S.S. Keerthi. Large scale semi-supervised linear SVMs.\n\nIn SIGIR \u201906:\nProceedings of the 29th annual international ACM SIGIR conference on Research and devel-\nopment in information retrieval, pages 477\u2013484, New York, NY, USA, 2006. ACM Press.\n\n[15] A. Zien, U. Brefeld, and T. Scheffer. Transductive support vector machines for structured\n\nvariables. In ICML, pages 1183\u20131190, 2007.\n\n[16] UCI repository, http://archive.ics.uci.edu/ml/\n[17] LibSVM, http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/\n[18] CRF++, http://chasen.org/\u02dctaku/software/CRF++\n[19] Stochastic Gradient Descent code, http://leon.bottou.org/projects/sgd\n[20] DMOZ ontology, http://www.dmoz.org\n\n9\n\n\f", "award": [], "sourceid": 523, "authors": [{"given_name": "Novi", "family_name": "Quadrianto", "institution": null}, {"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}