{"title": "Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 1433, "page_last": 1440, "abstract": "When training and test samples follow different input distributions (i.e., the situation called \\emph{covariate shift}), the maximum likelihood estimator is known to lose its consistency. For regaining consistency, the log-likelihood terms need to be weighted according to the \\emph{importance} (i.e., the ratio of test and training input densities). Thus, accurately estimating the importance is one of the key tasks in covariate shift adaptation. A naive approach is to first estimate training and test input densities and then estimate the importance by the ratio of the density estimates. However, since density estimation is a hard problem, this approach tends to perform poorly especially in high dimensional cases. In this paper, we propose a direct importance estimation method that does not require the input density estimates. Our method is equipped with a natural model selection procedure so tuning parameters such as the kernel width can be objectively optimized. This is an advantage over a recently developed method of direct importance estimation. Simulations illustrate the usefulness of our approach.", "full_text": "Direct Importance Estimation with Model Selection\nand Its Application to Covariate Shift Adaptation\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\nsugi@cs.titech.ac.jp\n\nShinichi Nakajima\nNikon Corporation\n\nnakajima.s@nikon.co.jp\n\nHisashi Kashima\n\nIBM Research\n\nPaul von B\u00a8unau\n\nTechnical University Berlin\n\nMotoaki Kawanabe\nFraunhofer FIRST\n\nhkashima@jp.ibm.com\n\nbuenau@cs.tu-berlin.de\n\nnabe@first.fhg.de\n\nAbstract\n\nA situation where training and test samples follow different input distributions is\ncalled covariate shift. Under covariate shift, standard learning methods such as\nmaximum likelihood estimation are no longer consistent\u2014weighted variants ac-\ncording to the ratio of test and training input densities are consistent. Therefore,\naccurately estimating the density ratio, called the importance, is one of the key is-\nsues in covariate shift adaptation. A naive approach to this task is to \ufb01rst estimate\ntraining and test input densities separately and then estimate the importance by\ntaking the ratio of the estimated densities. However, this naive approach tends to\nperform poorly since density estimation is a hard task particularly in high dimen-\nsional cases. In this paper, we propose a direct importance estimation method that\ndoes not involve density estimation. Our method is equipped with a natural cross\nvalidation procedure and hence tuning parameters such as the kernel width can be\nobjectively optimized. Simulations illustrate the usefulness of our approach.\n\n1 Introduction\n\nA common assumption in supervised learning is that training and test samples follow the same\ndistribution. However, this basic assumption is often violated in practice and then standard machine\nlearning methods do not work as desired. A situation where the input distribution P (x) is different\nin the training and test phases but the conditional distribution of output values, P (yjx), remains\nunchanged is called covariate shift [8]. In many real-world applications such as robot control [10],\nbioinformatics [1], spam \ufb01ltering [3], brain-computer interfacing [9], or econometrics [5], covariate\nshift is conceivable and thus learning under covariate shift is gathering a lot of attention these days.\n\nThe in\ufb02uence of covariate shift could be alleviated by weighting the log likelihood terms according\nto the importance [8]: w(x) = pte(x)=ptr(x), where pte(x) and ptr(x) are test and training input\ndensities. Since the importance is usually unknown, the key issue of covariate shift adaptation is\nhow to accurately estimate the importance.\n\nA naive approach to importance estimation would be to \ufb01rst estimate the training and test densities\nseparately from training and test input samples, and then estimate the importance by taking the ratio\nof the estimated densities. However, density estimation is known to be a hard problem particularly\nin high-dimensional cases. Therefore, this naive approach may not be effective\u2014directly estimating\nthe importance without estimating the densities would be more promising.\n\nFollowing this spirit, the kernel mean matching (KMM) method has been proposed recently [6],\nwhich directly gives importance estimates without going through density estimation. KMM is shown\n\n1\n\n\fto work well, given that tuning parameters such as the kernel width are chosen appropriately. In-\ntuitively, model selection of importance estimation algorithms (such as KMM) is straightforward\nby cross validation (CV) over the performance of subsequent learning algorithms. However, this is\nhighly unreliable since the ordinary CV score is heavily biased under covariate shift\u2014for unbiased\nestimation of the prediction performance of subsequent learning algorithms, the CV procedure itself\nneeds to be importance-weighted [9]. Since the importance weight has to have been \ufb01xed when\nmodel selection is carried out by importance weighted CV, it can not be used for model selection of\nimportance estimation algorithms.\n\nThe above fact implies that model selection of importance estimation algorithms should be per-\nformed within the importance estimation step in an unsupervised manner. However, since KMM\ncan only estimate the values of the importance at training input points, it can not be directly applied\nin the CV framework; an out-of-sample extension is needed, but this seems to be an open research\nissue currently.\n\nIn this paper, we propose a new importance estimation method which can overcome the above\nproblems, i.e., the proposed method directly estimates the importance without density estimation\nand is equipped with a natural model selection procedure. Our basic idea is to \ufb01nd an importance\n\nestimate bw(x) such that the Kullback-Leibler divergence from the true test input density pte(x)\nto its estimate bpte(x) = bw(x)ptr(x) is minimized. We propose an algorithm that can carry out\n\nthis minimization without explicitly modeling ptr(x) and pte(x). We call the proposed method the\nKullback-Leibler Importance Estimation Procedure (KLIEP). The optimization problem involved in\nKLIEP is convex, so the unique global solution can be obtained. Furthermore, the solution tends to\nbe sparse, which contributes to reducing the computational cost in the test phase.\n\nSince KLIEP is based on the minimization of the Kullback-Leibler divergence, its model selection\ncan be naturally carried out through a variant of likelihood CV, which is a standard model selection\ntechnique in density estimation. A key advantage of our CV procedure is that, not the training\nsamples, but the test input samples are cross-validated. This highly contributes to improving the\nmodel selection accuracy since the number of training samples is typically limited while test input\nsamples are abundantly available.\n\nThe simulation studies show that KLIEP tends to outperform existing approaches in importance\nestimation including the logistic regression based method [2], and it contributes to improving the\nprediction performance in covariate shift scenarios.\n\n2 New Importance Estimation Method\n\nIn this section, we propose a new importance estimation method.\n\n2.1 Formulation and Notation\n\ni gntr\nLet D (cid:26) (Rd) be the input domain and suppose we are given i.i.d. training input samples fxtr\ni=1\nj=1 from a\nfrom a training input distribution with density ptr(x) and i.i.d. test input samples fxte\ntest input distribution with density pte(x). We assume that ptr(x) > 0 for all x 2 D. Typically,\nthe number ntr of training samples is rather small, while the number nte of test input samples is\nvery large. The goal of this paper is to develop a method of estimating the importance w(x) from\nfxtr\n\ni=1 and fxte\n\nj gnte\nj=1:\n\nj gnte\n\ni gntr\n\nw(x) =\n\npte(x)\nptr(x)\n\n:\n\nOur key restriction is that we avoid estimating densities pte(x) and ptr(x) when estimating the\nimportance w(x).\n\n2.2 Kullback-Leibler Importance Estimation Procedure (KLIEP)\n\nLet us model the importance w(x) by the following linear model:\n\n(cid:11)\u2018\u2019\u2018(x);\n\n(1)\n\nbw(x) =\n\nbX\u2018=1\n\n2\n\n\fwhere f(cid:11)\u2018gb\nsuch that\n\n\u2018=1 are parameters to be learned from data samples and f\u2019\u2018(x)gb\n\n\u2018=1 are basis functions\n\ni=1 and fxte\nj=1, i.e., kernel\n\u2018=1 are chosen in Section 2.3.\n\nj gnte\n\n\u2018=1 in the model (1) so that the Kullback-Leibler divergence from\n\n\u2019\u2018(x) (cid:21) 0 for all x 2 D and for \u2018 = 1; 2; : : : ; b:\ni gntr\n\u2018=1 could be dependent on the samples fxtr\nNote that b and f\u2019\u2018(x)gb\nmodels are also allowed\u2014we explain how the basis functions f\u2019\u2018(x)gb\n\nUsing the model bw(x), we can estimate the test input density pte(x) by\npte(x) tobpte(x) is minimized:\n\nbpte(x) = bw(x)ptr(x):\n\nWe determine the parameters f(cid:11)\u2018gb\n\npte(x) log\n\npte(x)\n\ndx\n\nKL[pte(x)kbpte(x)] =ZD\n=ZD\n\npte(x) log\n\nbw(x)ptr(x)\ndx (cid:0)ZD\n\npte(x)\nptr(x)\n\nSince the \ufb01rst term in the last equation is independent of f(cid:11)\u2018gb\nsecond term. We denote it by J:\n\nJ =ZD\n\n1\nnte\n\n(cid:25)\n\npte(x) logbw(x)dx\nnteXj=1\nlogbw(xte\n\nj ) =\n\n1\nnte\n\nlog bX\u2018=1\n\nnteXj=1\n\n(cid:11)\u2018\u2019\u2018(xte\n\nj )! ;\nj gnte\n\npte(x) logbw(x)dx:\n\n\u2018=1, we ignore it and focus on the\n\n(2)\n\nj gnte\n\nj=1 is used from the \ufb01rst\nwhere the empirical approximation based on the test input samples fxte\nline to the second line above. This is our objective function to be maximized with respect to the\nparameters f(cid:11)\u2018gb\n\u2018=1, which is concave [4]. Note that the above objective function only involves the\ntest input samples fxte\ni=1 yet. As shown\ni gntr\nbelow, fxtr\n\nj=1, i.e., we did not use the training input samples fxtr\n\ni=1 will be used in the constraint.\n\nbw(x) is an estimate of the importance w(x) which is non-negative by de\ufb01nition. Therefore, it is\nnatural to impose bw(x) (cid:21) 0 for all x 2 D, which can be achieved by restricting\nIn addition to the non-negativity, bw(x) should be properly normalized sincebpte(x) (= bw(x)ptr(x))\n\n(cid:11)\u2018 (cid:21) 0 for \u2018 = 1; 2; : : : ; b:\n\nis a probability density function:\n\ni gntr\n\n(3)\n\n1 =ZDbpte(x)dx =ZD bw(x)ptr(x)dx\nbX\u2018=1\n\nntrXi=1\n\nntrXi=1 bw(xtr\n\n1\nntr\n\n1\nntr\n\ni ) =\n\n(cid:25)\n\n(cid:11)\u2018\u2019\u2018(xtr\n\ni );\n\nwhere the empirical approximation based on the training input samples fxtr\n\ufb01rst line to the second line above.\n\ni gntr\n\ni=1 is used from the\n\nNow our optimization criterion is summarized as follows.\n\n(cid:11)\u2018\u2019\u2018(xte\n\nj )!35\n\nnteXj=1\n\nmaximize\nf(cid:11)\u2018gb\n\n\u2018=1 24\n\nlog bX\u2018=1\nntrXi=1\nbX\u2018=1\nFigure 1-(a). Note that the solution fb(cid:11)\u2018gb\n\nsubject to\n\nThis is a convex optimization problem and the global solution can be obtained, e.g., by simply\nperforming gradient ascent and feasibility satisfaction iteratively. A pseudo code is described in\n\u2018=1 tends to be sparse [4], which contributes to reducing the\ncomputational cost in the test phase. We refer to the above method as Kullback-Leibler Importance\nEstimation Procedure (KLIEP).\n\n(cid:11)\u2018\u2019\u2018(xtr\n\ni ) = ntr and (cid:11)1; (cid:11)2; : : : ; (cid:11)b (cid:21) 0:\n\n3\n\n\fi=1, and fxte\n\nj gnte\nj=1\n\ni gntr\n\n\u2018=1, fxtr\n\nInput: m = f\u2019\u2018(x)gb\nOutput: bw(x)\nAj;\u2018 (cid:0) \u2019\u2018(xte\nj );\nPntr\ni=1 \u2019\u2018(xtr\nb\u2018 (cid:0) 1\nntr\nInitialize (cid:11) (> 0) and \" (0 < \" (cid:28) 1);\nRepeat until convergence\n\ni );\n\n(cid:11) (cid:0) (cid:11) + \"A>(1:=A(cid:11));\n(cid:11) (cid:0) (cid:11) + (1 (cid:0) b>(cid:11))b=(b>b);\n(cid:11) (cid:0) max(0; (cid:11));\n(cid:11) (cid:0) (cid:11)=(b>(cid:11));\n\nend\n\nbw(x) (cid:0) Pb\n\n\u2018=1 (cid:11)\u2018\u2019\u2018(x);\n\n(x)gb(k)\n\n\u2018=1 g,\n\nInput: M = fmk j mk = f\u2019(k)\nj gnte\nj=1\n\ni=1, and fxte\n\n\u2018\n\ni gntr\nfxtr\nOutput: bw(x)\nj gnte\nSplit fxte\nfor each model m 2 M\n\nj=1 into R disjoint subsets fX te\n\nr gR\n\nr=1;\n\nj gj6=r);\n\ni gntr\ni=1; fX te\nlog bwr(x);\n\nfor each split r = 1; : : : ; R\n\nbwr(x) (cid:0) KLIEP(m; fxtr\nr j Px2X te\nbJr(m) (cid:0) 1\nR PR\n\nr=1 bJr(m);\n\nend\nbJ(m) (cid:0) 1\n\njX te\n\nr\n\nend\nbm (cid:0) argmaxm2M bJ(m);\nbw(x) (cid:0) KLIEP(bm; fxtr\n\ni gntr\n\ni=1; fxte\n\nj gnte\n\nj=1);\n\n(a) KLIEP main code\n\n(b) KLIEP with model selection\n\nFigure 1: KLIEP algorithm in pseudo code. \u2018./\u2019 indicates the element-wise division and > denotes\nthe transpose. Inequalities and the \u2018max\u2019 operation for a vector are applied element-wise.\n\n2.3 Model Selection by Likelihood Cross Validation\n\nThe performance of KLIEP depends on the choice of basis functions f\u2019\u2018(x)gb\nhow they can be appropriately chosen from data samples.\n\n\u2018=1. Here we explain\n\nr\n\nbJr =\n\nj gnte\n\nSince KLIEP is based on the maximization of the score J (see Eq.(2)), it would be natural to select\nthe model such that J is maximized. The expectation over pte(x) involved in J can be numer-\nically approximated by likelihood cross validation (LCV) as follows: First, divide the test sam-\nples fxte\nfX te\n\nj gj6=r and approximate the score J using X te\n\nj=1 into R disjoint subsets fX te\n\nr=1. Then obtain an importance estimate bwr(x) from\nr j Xx2X te\n\n1\njX te\n\nr gR\n\nr as\n\n1\nR\n\nbJ =\n\nWe repeat this procedure for r = 1; 2; : : : ; R, compute the average of bJr over all r, and use the\naverage bJ as an estimate of J:\nFor model selection, we compute bJ for all model candidates (the basis functions f\u2019\u2018(x)gb\nthe current setting) and choose the one that minimizes bJ. A pseudo code of the LCV procedure is\n\nOne of the potential limitations of CV in general is that it is not reliable in small sample cases\nsince data splitting by CV further reduces the sample size. On the other hand, in our CV procedure,\nthe data splitting is performed over the test input samples, not over the training samples. Since we\ntypically have a large number of test input samples, our CV procedure does not suffer from the small\nsample problem.\n\nsummarized in Figure 1-(b)\n\n\u2018=1 in\n\n(4)\n\nlogbwr(x):\nRXr=1 bJr:\n\nA good model may be chosen by the above CV procedure, given that a set of promising model\ncandidates is prepared. As model candidates, we propose using a Gaussian kernel model centered at\nthe test input points fxte\n\nj=1, i.e.,\n\nj gnte\n\nwhere K(cid:27)(x; x0) is the Gaussian kernel with kernel width (cid:27):\n\nnteX\u2018=1\n\n\u2018 );\n\n(cid:11)\u2018K(cid:27)(x; xte\n\nbw(x) =\nK(cid:27)(x; x0) = exp(cid:26)(cid:0)kx (cid:0) x0k2\n\n2(cid:27)2\n\n4\n\n(cid:27) :\n\n(5)\n\n\fj gnte\n\ni gntr\n\nj=1 as the Gaussian centers, not the training\nThe reason why we chose the test input points fxte\ninput points fxtr\ni=1, is as follows. By de\ufb01nition, the importance w(x) tends to take large values\nif the training input density ptr(x) is small and the test input density pte(x) is large; conversely,\nw(x) tends to be small (i.e., close to zero) if ptr(x) is large and pte(x) is small. When a function\nis approximated by a Gaussian kernel model, many kernels may be needed in the region where the\noutput of the target function is large; on the other hand, only a small number of kernels would be\nenough in the region where the output of the target function is close to zero. Following this heuristic,\nwe decided to allocate many kernels at high test input density regions, which can be achieved by\nsetting the Gaussian centers at the test input points fxte\nAlternatively, we may locate (ntr +nte) Gaussian kernels at both fxtr\nj=1. However,\nin our preliminary experiments, this did not further improve the performance, but slightly increased\nj gnte\nthe computational cost. Since nte is typically very large, just using all the test input points fxte\nj=1\nas Gaussian centers is already computationally rather demanding. To ease this problem, we practi-\nj=1 as Gaussian centers for computational ef\ufb01ciency, i.e.,\ncally propose using a subset of fxte\n\ni=1 and fxte\n\nj gnte\nj=1.\n\ni gntr\n\nj gnte\n\nj gnte\nbw(x) =\n\nbX\u2018=1\n\n(cid:11)\u2018K(cid:27)(x; c\u2018);\n\n(6)\n\nj gnte\nwhere c\u2018 is a template point randomly chosen from fxte\nIn the rest of this paper, we \ufb01x the number of template points at\n\nj=1 and b ((cid:20) nte) is a pre\ufb01xed number.\n\nand optimize the kernel width (cid:27) by the above CV procedure.\n\nb = min(100; nte);\n\n3 Experiments\n\nIn this section, we compare the experimental performance of KLIEP and existing approaches.\n\n3.1\n\nImportance Estimation for Arti\ufb01cial Data Sets\n\nLet ptr(x) be the d-dimensional Gaussian density with mean (0; 0; : : : ; 0)> and covariance identity\nand pte(x) be the d-dimensional Gaussian density with mean (1; 0; : : : ; 0)> and covariance identity.\nThe task is to estimate the importance at training input points:\n\nwi = w(xtr\n\ni ) =\n\npte(xtr\ni )\nptr(xtr\ni )\n\nfor i = 1; 2; : : : ; ntr:\n\nWe compare the following methods:\n\nKLIEP((cid:27)): fwigntr\n\ni=1 are estimated by KLIEP with the Gaussian kernel model (6). Since the per-\nformance of KLIEP is dependent on the kernel width (cid:27), we test several different values of\n(cid:27).\n\nKLIEP(CV): The kernel width (cid:27) in KLIEP is chosen based on 5-fold LCV (see Section 2.3).\nKDE(CV): fwigntr\n\ni=1 are estimated through the kernel density estimator (KDE) with the Gaussian\nkernel. The kernel widths for the training and test densities are chosen separately based on\n5-fold likelihood cross-validation.\n\nKMM((cid:27)): fwigntr\n\ni=1 are estimated by kernel mean matching (KMM) [6]. The performance of KMM\nis dependent on tuning parameters such as B, (cid:15), and (cid:27). We set B = 1000 and (cid:15) = (pntr (cid:0)\n1)=pntr following the paper [6], and test several different values of (cid:27). We used the CPLEX\nsoftware for solving quadratic programs in the experiments.\n\nLogReg((cid:27)): Importance weights are estimated by logistic regression (LogReg) [2]. The Gaussian\nkernels are used as basis functions. Since the performance of LogReg is dependent on the\nkernel width (cid:27), we test several different values of (cid:27). We used the LIBLINEAR implemen-\ntation of logistic regression for the experiments [7].\n\nLogReg(CV): The kernel width (cid:27) in LogReg is chosen based on 5-fold CV.\n\n5\n\n\fof the importance estimates fbwigntr\n\nNMSE =\n\n1\nntr\n\nntrXi=1(cid:18)\n\nbwiPntr\ni0=1 bwi0 (cid:0)\n\ni0=1 wi0(cid:19)2\nwiPntr\n\n:\n\nl\n\n)\ne\na\nc\nS\ng\no\nL\n\n \n\n \n\nl\n\nn\ni\n(\n \ns\na\ni\nr\nT\n0\n0\n1\n\n \n\n \n\n \nr\ne\nv\no\nE\nS\nM\nN\ne\ng\na\nr\ne\nv\nA\n\n \n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n\n2\n\n4\n\n6\n\n10\n\n8\nd (Input Dimension)\n\n12\n\nKLIEP(0.5)\nKLIEP(2)\nKLIEP(7)\nKLIEP(CV)\nKDE(CV)\nKMM(0.1)\nKMM(1)\nKMM(10)\nLogReg(0.5)\nLogReg(2)\nLogReg(7)\nLogReg(CV)\n\n14\n\n16\n\n18\n\n20\n\nl\n\n)\ne\na\nc\nS\ng\no\nL\n\n \n\n \n\nl\n\nn\ni\n(\n \ns\na\ni\nr\nT\n0\n0\n1\n\n \n\n \n\n \nr\ne\nv\no\nE\nS\nM\nN\ne\ng\na\nr\ne\nv\nA\n\n \n\nKLIEP(0.5)\nKLIEP(2)\nKLIEP(7)\nKLIEP(CV)\nKDE(CV)\nKMM(0.1)\nKMM(1)\nKMM(10)\nLogReg(0.5)\nLogReg(2)\nLogReg(7)\nLogReg(CV)\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n\n50\n\n100\n\nn\n (Number of Training Samples)\ntr\n\n150\n\n(a) When input dimension is changed\n\n(b) When training sample size is changed\n\nFigure 2: NMSEs averaged over 100 trials in log scale.\n\nWe \ufb01xed the number of test input points at nte = 1000 and consider the following two settings for\nthe number ntr of training samples and the input dimension d:\n\n(a) ntr = 100 and d = 1; 2; : : : ; 20,\n(b) d = 10 and ntr = 50; 60; : : : ; 150.\n\nWe run the experiments 100 times for each d, each ntr, and each method, and evaluate the quality\n\ni=1 by the normalized mean squared error (NMSE):\n\nNMSEs averaged over 100 trials are plotted in log scale in Figure 2. Figure 2(a) shows that the error\nof KDE(CV) sharply increases as the input dimension grows, while KLIEP, KMM, and LogReg\nwith appropriate kernel widths tend to give smaller errors than KDE(CV). This would be the fruit\nof directly estimating the importance without going through density estimation. The graph also\nshow that the performance of KLIEP, KMM, and LogReg is dependent on the kernel width (cid:27)\u2014the\nresults of KLIEP(CV) and LogReg(CV) show that model selection is carried out reasonably well\nand KLIEP(CV) works signi\ufb01cantly better than LogReg(CV).\n\nFigure 2(b) shows that the errors of all methods tend to decrease as the number of training samples\ngrows. Again, KLIEP, KMM, and LogReg with appropriate kernel widths tend to give smaller\nerrors than KDE(CV). Model selection in KLIEP(CV) and LogReg(CV) works reasonably well and\nKLIEP(CV) tends to give signi\ufb01cantly smaller errors than LogReg(CV).\n\nOverall, KLIEP(CV) is shown to be a useful method in importance estimation.\n\n3.2 Covariate Shift Adaptation with Regression and Classi\ufb01cation Benchmark Data Sets\n\nk=1 into [0; 1]d and choose the test samples f(xte\n\nHere we employ importance estimation methods for covariate shift adaptation in regression and\nclassi\ufb01cation benchmark problems (see Table 1).\nk=1. We normalize all the input samples\nEach data set consists of input/output samples f(xk; yk)gn\nj )gnte\nk=1 as\nj ; yte\nfxkgn\nfollows. We randomly choose one sample (xk; yk) from the pool and accept this with probabil-\nity min(1; 4(x(c)\nis the c-th element of xk and c is randomly determined and \ufb01xed\nin each trial of experiments; then we remove xk from the pool regardless of its rejection or ac-\nceptance, and repeat this procedure until we accept nte samples. We choose the training samples\ni=1 uniformly from the rest. Intuitively, in this experiment, the test input density tends\nf(xtr\n\nj=1 from the pool f(xk; yk)gn\n\ni )gntr\n\nk )2), where x(c)\n\nk\n\ni ; ytr\n\n6\n\n\fto be lower than the training input density when x(c)\nntr = 100 and nte = 500 for all data sets. Note that we only use f(xtr\nfor training regressors or classi\ufb01ers; the test output values fyte\ngeneralization performance.\n\nj gnte\nWe use the following kernel model for regression or classi\ufb01cation:\n\nis small. We set the number of samples at\nj gnte\nj=1\nj=1 are used only for evaluating the\n\ni=1 and fxte\n\ni )gntr\n\ni ; ytr\n\nk\n\nwhere Kh(x; x0) is the Gaussian kernel (5) and m\u2018 is a template point randomly chosen from\nj=1. We set the number of kernels at t = 50. We learn the parameter (cid:18) by importance-\nfxte\nweighted regularized least squares (IWRLS) [9]:\n\nj gnte\n\ntX\u2018=1\n\n(cid:18)\u2018Kh(x; m\u2018);\n\nbf (x; (cid:18)) =\n\" ntrXi=1 bw(xtr\nThe solutionb(cid:18)IW RLS is analytically given by\nb(cid:18) = (K >cW K + (cid:21)I)(cid:0)1K>cW y;\n\nb(cid:18)IW RLS (cid:17) argmin\n\ni )(cid:16)bf (xtr\n\nwhere I is the identity matrix and\n\ni (cid:17)2\ni ; (cid:18)) (cid:0) ytr\n\n(cid:18)\n\n+ (cid:21)k(cid:18)k2# :\n\n(7)\n\nThe results are summarized in Table 1, where \u2018Uniform\u2019 denotes uniform weights, i.e., no impor-\ntance weight is used. The table shows that KLIEP(CV) compares favorably with Uniform, implying\nthat the importance weighted methods combined with KLIEP(CV) are useful for improving the pre-\ndiction performance under covariate shift. KLIEP(CV) works much better than KDE(CV); actually\nKDE(CV) tends to be worse than Uniform, which may be due to high dimensionality. We tested\n10 different values of the kernel width (cid:27) for KMM and described three representative results in the\ntable. KLIEP(CV) is slightly better than KMM with the best kernel width. Finally, LogReg(CV)\nworks reasonably well, but it sometimes performs poorly.\n\nOverall, we conclude that the proposed KLIEP(CV) is a promising method for covariate shift adap-\ntation.\n\n4 Conclusions\n\nIn this paper, we addressed the problem of estimating the importance for covariate shift adaptation.\nThe proposed method, called KLIEP, does not involve density estimation so it is more advantageous\nthan a naive KDE-based approach particularly in high-dimensional problems. Compared with KMM\n\n7\n\nThe kernel width h and the regularization parameter (cid:21) in IWRLS (7) are chosen by 5-fold importance\nweighted CV (IWCV) [9]. We compute the IWCV score by\n\ny = (y1; y2; : : : ; yntr )>;\n\nKi;\u2018 = Kh(xtr\n\ni ; m\u2018);\n\n1\njZ tr\n\ncW = diag (bw1;bw2; : : : ;bwntr ) :\nr bw(x)L(cid:16)bfr(x); y(cid:17) ;\nr j X(x;y)2Z tr\nL (by; y) =(cid:26)(by (cid:0) y)2\n2 (1 (cid:0) signfbyyg)\nL(cid:16)bf (xte\nnteXj=1\n\nj (cid:17) :\n\nj ); yte\n\n1\nnte\n\n1\n\n(Regression),\n(Classi\ufb01cation).\n\nwhere\n\nWe run the experiments 100 times for each data set and evaluate the mean test error:\n\n\fTable 1: Mean test error averaged over 100 trials. The numbers in the brackets are the standard devi-\nation. All the error values are normalized so that the mean error by \u2018Uniform\u2019 (uniform weighting,\nor equivalently no importance weighting) is one. For each data set, the best method and comparable\nones based on the Wilcoxon signed rank test at the signi\ufb01cance level 5% are described in bold face.\nThe upper half are regression data sets taken from DELVE and the lower half are classi\ufb01cation data\nsets taken from IDA. \u2018KMM((cid:27))\u2019 denotes KMM with kernel width (cid:27).\n\nData\n\nDim Uniform\n\nKLIEP\n(CV)\n\nKDE\n(CV)\n\nKMM\n(0.01)\n\nKMM\n(0.3)\n\nKMM\n\n(1)\n\nLogReg\n\n(CV)\n\nkin-8fh\n1:00(0:34) 0:95(0:31) 1:22(0:52) 1:00(0:34) 1:12(0:37) 1:59(0:53) 1:30(0:40)\n8\nkin-8fm 8\n1:00(0:39) 0:86(0:35) 1:12(0:57) 1:00(0:39) 0:98(0:46) 1:95(1:24) 1:29(0:58)\nkin-8nh\n8 1:00(0:26) 0:99(0:22) 1:09(0:20) 1:00(0:27) 1:04(0:17) 1:16(0:25) 1:06(0:17)\nkin-8nm 8 1:00(0:30) 0:97(0:25) 1:14(0:26) 1:00(0:30) 1:09(0:23) 1:20(0:22) 1:13(0:25)\nabalone\n7 1:00(0:50) 0:94(0:67) 1:02(0:41) 1:01(0:51) 0:96(0:70) 0:93(0:39) 0:92(0:41)\nimage\n18 1:00(0:51) 0:94(0:44) 0:98(0:45) 0:97(0:50) 0:97(0:45) 1:09(0:54) 0:99(0:48)\nringnorm 20\n1:00(0:04) 0:99(0:06) 0:87(0:04) 1:00(0:04) 0:87(0:05) 0:87(0:05) 0:95(0:08)\ntwonorm 20\n1:00(0:58) 0:91(0:52) 1:16(0:71) 0:99(0:50) 0:86(0:55) 0:99(0:70) 0:94(0:59)\nwaveform 21 1:00(0:45) 0:93(0:34) 1:05(0:47) 1:00(0:44) 0:93(0:32) 0:98(0:31) 0:95(0:34)\nAverage\n1.06(0.37)\n\n0.98(0.37)\n\n0.94(0.35)\n\n1.20(0.47)\n\n1.00(0.38)\n\n1.07(0.40)\n\n1.00(0.36)\n\nwhich also directly gives importance estimates, KLIEP is practically more useful since it is equipped\nwith a model selection procedure. Our experiments highlighted these advantages and therefore\nKLIEP is shown to be a promising method for covariate shift adaptation.\n\nIn KLIEP, we modeled the importance function by a linear (or kernel) model, which resulted in a\nconvex optimization problem with a sparse solution. However, our framework allows the use of any\nmodels. An interesting future direction to pursue would be to search for a class of models which has\nadditional advantages.\n\nFinally, the range of application of importance weights is not limited to covariate shift adaptation.\nFor example, the density ratio could be used for novelty detection. Exploring possible application\nareas will be important future directions.\n\nAcknowledgments\n\nThis work was supported by MEXT (17700142 and 18300057), the Okawa Foundation, the Mi-\ncrosoft CORE3 Project, and the IBM Faculty Award.\n\nReferences\n[1] P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, 1998.\n[2] S. Bickel, M. Br\u00a8uckner, and T. Scheffer. Discriminative learning for differing training and test distribu-\n\ntions. In Proceedings of the 24th International Conference on Machine Learning, 2007.\n\n[3] S. Bickel and T. Scheffer. Dirichlet-enhanced spam \ufb01ltering based on biased samples. In B. Sch\u00a8olkopf,\nJ. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press,\nCambridge, MA, 2007.\n\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.\n[5] J. J. Heckman. Sample selection bias as a speci\ufb01cation error. Econometrica, 47(1):153\u2013162, 1979.\n[6] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00a8olkopf. Correcting sample selection bias\nby unlabeled data. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\nProcessing Systems 19, pages 601\u2013608. MIT Press, Cambridge, MA, 2007.\n\n[7] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression.\n\nTechnical report, Department of Computer Science, National Taiwan University, 2007.\n\n[8] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[9] M. Sugiyama, M. Krauledat, and K.-R. M\u00a8uller. Covariate shift adaptation by importance weighted cross\n\nvalidation. Journal of Machine Learning Research, 8:985\u20131005, May 2007.\n\n[10] R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,\n\n1998.\n\n8\n\n\f", "award": [], "sourceid": 232, "authors": [{"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "Shinichi", "family_name": "Nakajima", "institution": null}, {"given_name": "Hisashi", "family_name": "Kashima", "institution": null}, {"given_name": "Paul", "family_name": "Buenau", "institution": null}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": null}]}