{"title": "Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 809, "page_last": 816, "abstract": "We address the problem of estimating the ratio of two probability density functions (a.k.a.~the importance). The importance values can be used for various succeeding tasks such as non-stationarity adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally very efficient and numerically stable. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bound. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches.", "full_text": "Ef\ufb01cient Direct Density Ratio Estimation for\n\nNon-stationarity Adaptation and Outlier Detection\n\nTakafumi Kanamori\nNagoya University\n\nNagoya, Japan\n\nShohei Hido\nIBM Research\n\nKanagawa, Japan\n\nkanamori@is.nagoya-u.ac.jp\n\nhido@jp.ibm.com\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo, Japan\n\nsugi@cs.titech.ac.jp\n\nAbstract\n\nWe address the problem of estimating the ratio of two probability density functions\n(a.k.a. the importance). The importance values can be used for various succeed-\ning tasks such as non-stationarity adaptation or outlier detection. In this paper, we\npropose a new importance estimation method that has a closed-form solution; the\nleave-one-out cross-validation score can also be computed analytically. Therefore,\nthe proposed method is computationally very ef\ufb01cient and numerically stable. We\nalso elucidate theoretical properties of the proposed method such as the conver-\ngence rate and approximation error bound. Numerical experiments show that the\nproposed method is comparable to the best existing method in accuracy, while it\nis computationally more ef\ufb01cient than competing approaches.\n\n1 Introduction\n\nIn the context of importance sampling, the ratio of two probability density functions is called the\nimportance. The problem of estimating the importance is gathering a lot of attention these days\nsince the importance can be used for various succeeding tasks, e.g.,\n\nCovariate shift adaptation: Covariate shift is a situation in supervised learning where the distri-\nbutions of inputs change between the training and test phases but the conditional distribution of\noutputs given inputs remains unchanged [8]. Covariate shift is conceivable in many real-world\napplications such as bioinformatics, brain-computer interfaces, robot control, spam \ufb01ltering, and\neconometrics. Under covariate shift, standard learning techniques such as maximum likelihood es-\ntimation or cross-validation are biased and therefore unreliable\u2014the bias caused by covariate shift\ncan be compensated by weighting the training samples according to the importance [8, 5, 1, 9].\n\nOutlier detection: The outlier detection task addressed here is to identify irregular samples in an\nevaluation dataset based on a model dataset that only contains regular samples [7, 3]. The importance\nvalues for regular samples are close to one, while those for outliers tend to be signi\ufb01cantly deviated\nfrom one. Thus the values of the importance could be used as an index of the degree of outlyingness.\n\nBelow, we refer to the two sets of samples as the training and test sets. A naive approach to estimat-\ning the importance is to \ufb01rst estimate the training and test densities from the sets of training and test\nsamples separately, and then take the ratio of the estimated densities. However, density estimation is\nknown to be a hard problem particularly in high-dimensional cases. In practice, such an appropriate\nparametric model may not be available and therefore this naive approach is not so effective.\n\n1\n\n\fTo cope with this problem, we propose a direct importance estimation method that does not involve\ndensity estimation. The proposed method, which we call least-squares importance \ufb01tting (LSIF), is\nformulated as a convex quadratic program and therefore the unique global solution can be obtained.\nWe give a cross-validation method for model selection and a regularization path tracking algorithm\nfor ef\ufb01cient computation [4].\n\nThis regularization path tracking algorithm is turned out to be computationally very ef\ufb01cient since\nthe entire solution path can be traced without a quadratic program solver. However, it tends to share a\ncommon weakness of path tracking algorithms, i.e., accumulation of numerical errors. To overcome\nthis drawback, we develop an approximation algorithm called unconstrained LSIF (uLSIF), which\nallows us to obtain the closed-form solution that can be stably computed just by solving a system\nof linear equations. Thus uLSIF is computationally ef\ufb01cient and numerically stable. Moreover,\nthe leave-one-out error of uLSIF can also be computed analytically, which further improves the\ncomputational ef\ufb01ciency in model selection scenarios.\n\nWe experimentally show that the accuracy of uLSIF is comparable to the best existing method while\nits computation is much faster than the others in covariate shift adaptation and outlier detection.\n\n2 Direct Importance Estimation\n\nFormulation and Notation: Let D \u2282 (Rd) be the data domain and suppose we are given inde-\npendent and identically distributed (i.i.d.) training samples {xtr\ni=1 from a training distribution\nwith density ptr(x) and i.i.d. test samples {xte\nj=1 from a test distribution with density pte(x). We\nassume ptr(x) > 0 for all x \u2208 D. The goal of this paper is to estimate the importance\n\nj }nte\n\ni }ntr\n\nw(x) = pte(x)\nptr(x)\n\ni }ntr\n\nfrom {xtr\npte(x) and ptr(x) when estimating the importance w(x).\n\ni=1 and {xte\n\nj }nte\n\nj=1. Our key restriction is that we want to avoid estimating densities\n\nLeast-squares Approach: Let us model the importance w(x) by the following linear model:\n\nbw(x) = \u03b1\u22a4\u03d5(x),\n\n(1)\n\nwhere \u22a4 denotes the transpose, \u03b1 = (\u03b11, . . . , \u03b1b)\u22a4, is a parameter to be learned, b is the number of\nparameters, \u03d5(x) = (\u03d51(x), . . . , \u03d5b(x))\u22a4 are basis functions such that \u03d5(x) \u2265 0b for all x \u2208 D,\n0b denotes the b-dimensional vector with all zeros, and the inequality for vectors is applied in the\nelement-wise manner. Note that b and {\u03d5\u2113(x)}b\n\u2113=1 could be dependent on the samples i.e., kernel\nmodels are also allowed. We explain how the basis functions {\u03d5\u2113(x)}b\nWe determine the parameter \u03b1 so that the following squared error is minimized:\n\n\u2113=1 are chosen later.\n\nJ0(\u03b1) = 1\n\nwhere C = 1\n\nptr(x)(cid:17)2\n2 R (cid:16)bw(x) \u2212 pte(x)\n2 R w(x)pte(x)dx is a constant and therefore can be safely ignored. Let\n\n2 R bw(x)2ptr(x)dx \u2212R bw(x)pte(x)dx + C,\n\nptr(x)dx = 1\n\nJ(\u03b1) = J0(\u03b1) \u2212 C = 1\n\n2 \u03b1\u22a4H\u03b1 \u2212 h\u22a4\u03b1,\n\n(2)\n\nwhere H = R \u03d5(x)\u03d5(x)\u22a4ptr(x)dx, h = R \u03d5(x)pte(x)dx. Using the empirical approximation\n\nand taking into account the non-negativity of the importance function w(x), we obtain\n\nmin\u03b1\u2208Rb h 1\ni )\u03d5(xtr\ni=1 \u03d5(xtr\n\n2 \u03b1\u22a4cH\u03b1 \u2212 bh\ni )\u22a4,\n\nb \u03b1i\nj=1 \u03d5(xte\n\nnte Pnte\n\nbh = 1\n\nwhere cH = 1\n\nntr Pntr\n\nfor avoiding over\ufb01tting, \u03bb \u2265 0, and 1b is the b-dimensional vector with all ones.\nThe above problem is a convex quadratic program and therefore the global optimal solution can be\nobtained by a standard software. We call this method Least-Squares Importance Fitting (LSIF).\n\nj ). \u03bb1\u22a4\n\nb \u03b1 is a regularization term\n\n\u22a4\n\n\u03b1 + \u03bb1\u22a4\n\ns.t. \u03b1 \u2265 0b,\n\n(3)\n\n2\n\n\fConvergence Analysis of LSIF: Here, we theoretically analyze the convergence property of the\n\nsolution b\u03b1 of the LSIF algorithm. Let \u03b1\u2217 be the optimal solution of the \u2018ideal\u2019 problem:\n\n2 \u03b1\u22a4H\u03b1 \u2212 h\u22a4\u03b1 + \u03bb1\u22a4\n\ns.t. \u03b1 \u2265 0b.\n\n(4)\n\nmin\u03b1\u2208Rb h 1\n\nb \u03b1i\n\nLet f (n) = \u03c9(g(n)) mean that f (n) asymptotically dominates g(n), i.e., for all C > 0, there exists\nn0 such that |Cg(n)| < |f (n)| for all n > n0. Then we have the following theorem.\n\nTheorem 1 Assume that (a) the optimal solution of the problem (4) satis\ufb01es the strict comple-\nmentarity condition, and (b) ntr and nte satisfy nte = \u03c9(n2\n\ntr). Then we have E[J(b\u03b1)] =\ntr (cid:1), where E denotes the expectation over all possible training samples of size ntr\n\nJ(\u03b1\u2217) + O(cid:0)n\u22121\n\nand all possible test samples of size nte.\n\nTheorem 1 guarantees that LSIF converges to the ideal solution with order n\u22121\nexplicitly obtain the coef\ufb01cient of the term of order n\u22121\n\ntr . It is possible to\ntr , but we omit the detail due to lack of space.\n\nModel Selection for LSIF: The performance of LSIF depends on the choice of the regularization\nparameter \u03bb and basis functions {\u03d5\u2113(x)}b\n\u2113=1 (which we refer to as a model). Since our objective is\nto minimize the cost function J, it is natural to determine the model such that J is minimized.\n\nr }R\n\ni }ntr\n\ni=1 and test samples {xte\n\nsamples: First, the training samples {xtr\nsubsets {X tr\n{X tr\n\nHere, we employ cross-validation for estimating J(b\u03b1), which has an accuracy guarantee for \ufb01nite\nj=1 are divided into R disjoint\nr=1, respectively. Then an importance estimate bwr(x) is obtained using\nj }j6=r and {X te\nas bJ (CV)\nr bwr(xtr)2 \u2212 1\nr bwr(xte). This procedure is repeated for\nr = 1, . . . , R and its average bJ (CV) is used as an estimate of J. We can show that bJ (CV) gives an\n\nr=1 and {X te\nr }R\nj }j6=r, and the cost J is approximated using the held-out samples X tr\nr | Pxtr\u2208X tr\n\nalmost unbiased estimate of the true cost J, where the \u2018almost\u2019-ness comes from the fact that the\nnumber of samples is reduced due to data splitting.\n\nr | Pxte\u2208X te\n\nr and X te\n\n= 1\n2|X tr\n\nj }nte\n\n|X te\n\nr\n\nr\n\nHeuristics of Basis Function Design: A good model may be chosen by cross-validation, given\nthat a family of promising model candidates is prepared. As model candidates, we propose using a\nGaussian kernel model centered at the test input points {xte\n\nj }nte\n\nj=1, i.e.,\n\nbw(x) = Pnte\n\n\u2113=1 \u03b1\u2113K\u03c3(x, xte\n\n\u2113 ),\n\nwhere K\u03c3(x, x\u2032) = exp(cid:0)\u2212kx \u2212 x\u2032k2/(2\u03c32)(cid:1) .\n\n(5)\n\nj }nte\n\ni }ntr\n\nThe reason why we chose the test input points {xte\nj=1 as the Gaussian centers, not the training\ninput points {xtr\ni=1, is as follows. By de\ufb01nition, the importance w(x) tends to take large values\nif the training input density ptr(x) is small and the test input density pte(x) is large; conversely,\nw(x) tends to be small (i.e., close to zero) if ptr(x) is large and pte(x) is small. When a function\nis approximated by a Gaussian kernel model, many kernels may be needed in the region where the\noutput of the target function is large; on the other hand, only a small number of kernels would be\nenough in the region where the output of the target function is close to zero. Following this heuristic,\nwe allocate many kernels at high test input density regions, which can be achieved by setting the\nGaussian centers at the test input points {xte\n\nj }nte\nj=1.\n\nAlternatively, we may locate (ntr + nte) Gaussian kernels at both {xtr\nj=1. How-\never, in our preliminary experiments, this did not further improve the performance, but just slightly\nj=1 as\nincreased the computational cost. When nte is large, just using all the test input points {xte\nGaussian centers is already computationally rather demanding. To ease this problem, we practically\npropose using a subset of {xte\n\nj=1 as Gaussian centers for computational ef\ufb01ciency, i.e.,\n\ni=1 and {xte\n\nj }nte\n\nj }nte\n\nj }nte\n\ni }ntr\n\nbw(x) = Pb\n\n\u2113=1 \u03b1\u2113K\u03c3(x, c\u2113),\n\n(6)\n\nj=1 and b (\u2264 nte) is a pre\ufb01xed number.\nwhere c\u2113 is a template point randomly chosen from {xte\nIn the experiments shown later, we \ufb01x the number of template points at b = min(100, nte), and\noptimize the kernel width \u03c3 and the regularization parameter \u03bb by cross-validation with grid search.\n\nj }nte\n\n3\n\n\fEntire Regularization Path for LSIF: We can show that the LSIF solution b\u03b1 is piecewise linear\n\nwith respect to the regularization parameter \u03bb. Therefore, the regularization path (i.e., solutions for\nall \u03bb) can be computed ef\ufb01ciently based on the parametric optimization technique [4].\nA basic idea of regularization path tracking is to check the violation of the Karush-Kuhn-\nTucker (KKT) conditions\u2014which are necessary and suf\ufb01cient conditions for optimality of convex\nprograms\u2014when the regularization parameter \u03bb is changed. Although the detail of the algorithm\nis omitted due to lack of space, we can show that a quadratic programming solver is no longer\nneeded for obtaining the entire solution path of LSIF\u2014just computing matrix inverses is enough.\nThis highly contributes to saving the computation time. However, in our preliminary experiments,\nthe regularization path tracking algorithm is turned out to be numerically rather unreliable since the\nnumerical errors tend to be accumulated when tracking the regularization path. This seems to be a\ncommon pitfall of solution path tracking algorithms in general.\n\n3 Approximation Algorithm\n\nUnconstrained Least-squares Approach: The approximation idea we introduce here is very sim-\nple: we ignore the non-negativity constraint of the parameters in the optimization problem (3). Thus\n\nmin\u03b2\u2208Rb h 1\n\n2 \u03b2\u22a4cH\u03b2 \u2212 bh\n\n\u22a4\n\n\u03b2 + \u03bb\n\n2 \u03b2\u22a4\u03b2i .\n\n(7)\n\nIn the above, we included a quadratic regularization term \u03bb\u03b2\u22a4\u03b2/2, instead of the linear one \u03bb1\u22a4\nb \u03b1\nsince the linear penalty term does not work as a regularizer without the non-negativity constraint.\nEq.(7) is an unconstrained convex quadratic program, so the solution can be analytically computed.\nHowever, since we dropped the non-negativity constraint \u03b2 \u2265 0b, some of the learned parameters\ncould be negative. To compensate for this approximation error, we modify the solution by\n\nb\u03b2 = max(0b,e\u03b2),\n\ne\u03b2 = (cH + \u03bbI b)\u22121bh,\n\n(8)\n\nwhere I b is the b-dimensional identity matrix and the \u2018max\u2019 operation for vectors is applied in the\nelement-wise manner. This is the solution of the approximation method we propose in this section.\n\nAn advantage of the above unconstrained formulation is that the solution can be computed just by\nsolving a system of linear equations. Therefore, the computation is fast and stable. We call this\nmethod unconstrained LSIF (uLSIF). Due to the \u21132 regularizer, the solution tends to be close to\n0b to some extent. Thus, the effect of ignoring the non-negativity constraint may not be so strong.\nBelow, we theoretically analyze the approximation error of uLSIF.\n\nConvergence Analysis of uLSIF: Here, we theoretically analyze the convergence property of\n\nthe solution b\u03b2 of the uLSIF algorithm. Let \u03b2\u2217 be the optimal solution of the \u2018ideal\u2019 problem:\n\u03b2\u2217 = max(0b, \u03b2\u25e6), where \u03b2\u25e6 = argmin\u03b2\u2208Rb h 1\n\n2 \u03b2\u22a4\u03b2i. Then we have\n\n2 \u03b2\u22a4H\u03b2 \u2212 h\u22a4\u03b2 + \u03bb\n\nTheorem 2 Assume that (a) \u03b2\u25e6\n\nThen we have E[J(b\u03b2)] = J(\u03b2\u2217) + O(cid:0)n\u22121\ntr (cid:1).\n\n\u2113 6= 0 for \u2113 = 1, . . . , b, and (b) ntr and nte satisfy nte = \u03c9(n2\n\ntr).\n\nTheorem 2 guarantees that uLSIF converges to the ideal solution with order n\u22121\ntr . It is possible to\nexplicitly obtain the coef\ufb01cient of the term of order n\u22121\ntr , but we omit the detail due to lack of space.\nWe can also derive upper bounds on the difference between LSIF and uLSIF and show that uLSIF\ngives a good approximation to LSIF. However, we do not go into the detail due to space limitation.\n\nEf\ufb01cient Computation of Leave-one-out Cross-validation Score: Another practically very im-\nportant advantage of uLSIF is that the score of leave-one-out cross-validation (LOOCV) can also\nbe computed analytically\u2014thanks to this property, the computational complexity for performing\nLOOCV is the same order as just computing a single solution. In the current setting, we are given\nj=1, which generally have different sample size. For sim-\ntwo sets of samples, {xtr\nplicity, we assume that ntr < nte and the i-th training sample xtr\ni are\nj=ntr+1 are always used for importance estimation.\nheld out at the same time; the test samples {xte\n\ni and the i-th test sample xte\n\ni=1 and {xte\n\nj }nte\n\nj }nte\n\ni }ntr\n\n4\n\n\fLet b\u03b2\n\n(i)\ni and the i-th test sample xte\n\u03bb be a parameter learned without the i-th training sample xtr\ni .\nntr Pntr\n(i)\n(i)\ni )\u22a4b\u03b2\ni=1[ 1\nThen the LOOCV score is expressed as\n\u03bb ]. Our ap-\n\u03bb )2 \u2212 \u03d5(xte\nproach to ef\ufb01ciently computing the LOOCV score is to use the Sherman-Woodbury-Morrison for-\n\u03bb can be expressed as b\u03b2\nntr(nte\u22121) (a +\na\u22a4\u03d5(xtr\ni ), ate =\nntr\u2212\u03d5(xtr\nA\u22121\u03d5(xte\nI b. This implies that the matrix inverse needs to be computed only\nonce (i.e., A\u22121) for calculating LOOCV scores. Thus LOOCV can be carried out very ef\ufb01ciently\nwithout repeating hold-out loops.\n\n\u03bb = max{0b, (ntr\u22121)nte\n)}, where, a = A\u22121bh, atr = A\u22121\u03d5(xtr\n\nmula for computing matrix inverses\u2014b\u03b2\nntr(nte\u22121) (atr + a\u22a4\n\ni ), A = cH + (ntr\u22121)\u03bb\n\ni )\u22a4b\u03b2\n\nte\u03d5(xtr\nntr\u2212\u03d5(xtr\n\ni )\u00b7atr\ni )\u22a4atr\n\ni )\u00b7ate\ni )\u22a4ate\n\n) \u2212 (ntr\u22121)\n\nntr\n\n1\n\n(i)\n\n2 (\u03d5(xtr\n\n(i)\n\n4 Relation to Existing Methods\n\nKernel density estimator (KDE) is a non-parametric technique to estimate a probability density func-\n\ntion. KDE can be used for importance estimation by \ufb01rst estimating bptr(x) and bpte(x) separately\nj }nte\nj=1 and then estimating the importance by bw(x) = bpte(x)/bptr(x). KDE\n\nfrom {xtr\nis ef\ufb01cient in computation since no optimization is involved, and model selection is possible by\nlikelihood cross validation. However, KDE may suffer from the curse of dimensionality.\n\ni=1 and {xte\n\ni }ntr\n\nThe kernel mean matching (KMM) method allows us to directly obtain an estimate of the importance\nvalues at training points without going through density estimation [5]. KMM can overcome the curse\nof dimensionality by directly estimating the importance using a special property of the Gaussian\nreproducing kernel Hilbert space. However, there is no objective model selection method for the\nregularization parameter and kernel width. As for the regularization parameter, we may follow a\nsuggestion in the original paper, which is justi\ufb01ed by a theoretical argument to some extent [5].\nAs for the Gaussian width, we may adopt a popular heuristic to use the median distance between\nsamples, although there seems no strong justi\ufb01cation for this. The computation of KMM is rather\ndemanding since a quadratic programming problem has to be solved.\n\nOther approaches to directly estimating the importance is to directly \ufb01t an importance model to the\ntrue importance\u2014a method based on logistic regression (LogReg) [1], or a method based on the\nkernel model (6) (which is called the Kullback-Leibler importance estimation procedure, KLIEP)\n[9, 6]. Model selection of these methods is possible by cross-validation, which is a signi\ufb01cant\nadvantage over KMM. However, LogReg and KLIEP are computationally rather expensive since\nnon-linear optimization problems have to be solved.\n\nThe proposed LSIF is qualitatively similar to LogReg and KLIEP, i.e., it can avoid density estima-\ntion, model selection is possible, and non-linear optimization is involved. However, LSIF is advan-\ntageous over LogReg and KLIEP in that it is equipped with a regularization path tracking algorithm.\nThanks to this, model selection of LSIF is computationally much more ef\ufb01cient than LogReg and\nKLIEP. However, the regularization path tracking algorithm tends to be numerically unstable.\n\nThe proposed uLSIF inherits good properties of existing methods such as no density estimation\ninvolved and a build-in model selection method equipped. In addition to these preferable properties,\nthe solution of uLSIF can be computed analytically through matrix inversion and therefore uLSIF\nis computationally very ef\ufb01cient and numerically stable. Furthermore, the closed-form solution of\nuLSIF allows us to compute the LOOCV score analytically without repeating hold-out loops, which\nhighly contributes to reducing the computation time in the model selection phase.\n\n5 Experiments\n\nImportance Estimation: Let ptr(x) be the d-dimensional normal distribution with mean zero and\ncovariance identity; let pte(x) be the d-dimensional normal distribution with mean (1, 0, . . . , 0)\u22a4\ni )}ntr\ni=1.\nand covariance identity. The task is to estimate the importance at training input points: {w(xtr\nWe \ufb01xed the number of test input points at nte = 1000 and consider the following two settings for\nthe number ntr of training samples and the input dimension d: (a) ntr = 100 and d = 1, 2, . . . , 20,\n(b) d = 10 and ntr = 50, 60, . . . , 150. We run the experiments 100 times for each d, each ntr, and\ni=1 by the normalized mean\n\neach method, and evaluate the quality of the importance estimates {bwi}ntr\n\n5\n\n\fKDE\nKMM\nLogReg\nKLIEP\nuLSIF\n\n5\n\n10\n\nd (Input Dimension)\n\n15\n\n20\n\ni\n\n]\nc\ne\ns\n[\n \ne\nm\nT\n \nn\no\ni\nt\na\nt\nu\np\nm\no\nC\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n5\n\nKDE\nKMM\nLogReg\nKLIEP\nuLSIF\n\n]\nc\ne\ns\n[\n \n\ni\n\ne\nm\nT\nn\no\n\n \n\ni\nt\n\nt\n\na\nu\np\nm\no\nC\n\n \nl\n\nt\n\na\no\nT\n\n20\n\n15\n\n10\n\n5\n\n0\n\nLogReg\nuLSIF\n\n5\n\n10\n\nd (Input Dimension)\n\n15\n\n20\n\n10\n\n15\n\nd (Input Dimension)\n\n(a) When d is changed\n\n(a) When d is changed\n\n(a) When d is changed\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n\n10\u22123\n\n10\u22124\n\n10\u22125\n\n10\u22126\n\nl\n\n)\ne\na\nc\nS\ng\no\nL\n\n \n\n \n\nl\n\n \n\nn\ni\n(\n \ns\na\ni\nr\nT\n0\n0\n1\n \nr\ne\nv\no\nE\nS\nM\nN\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\nl\n\n)\ne\na\nc\nS\ng\no\nL\n\n \n\n \n\nl\n\nn\ni\n(\n \ns\na\ni\nr\nT\n0\n0\n1\n\n \n\n \n\n \nr\ne\nv\no\nE\nS\nM\nN\ne\ng\na\nr\ne\nv\nA\n\n \n\n0.15\n\n0.1\n\n]\nc\ne\ns\n[\n \n\ni\n\ne\nm\nT\nn\no\n\n \n\ni\nt\n\nt\n\na\nu\np\nm\no\nC\n\n0.05\n\n0\n50\n\n12\n\n10\n\n]\nc\ne\ns\n[\n \n\ni\n\ne\nm\nT\nn\no\n\n \n\ni\nt\n\na\n\nt\n\nu\np\nm\no\nC\n\n \nl\n\na\n\nt\n\no\nT\n\n8\n\n6\n\n4\n\n2\n\nntr (Number of Training Samples)\n\n100\n\n150\n\n0\n50\n\n(b) When ntr is changed\n\nntr (Number of Training Samples)\n\n100\n\n150\n\n(b) When ntr is changed\n\n50\n\nntr (Number of Training Samples)\n\n100\n\n150\n\n(b) When ntr is changed\n\nFigure 1: NMSEs averaged\nover 100 trials in log scale.\n\nFigure 2: Mean computation\ntime (after model selection)\nover 100 trials.\n\nFigure 3: Mean computation\ntime (including model selec-\ntion of \u03c3 and \u03bb over 9\u00d79 grid).\n\nsquared error (NMSE): 1\nnormalized to be one, respectively.\n\nntr Pntr\n\ni=1 (bw(xtr\n\ni ) \u2212 w(xtr\n\ni ))2, where Pntr\n\ni ) and Pntr\n\ni=1 w(xtr\n\ni ) are\n\ni=1 bw(xtr\n\nNMSEs averaged over 100 trials (a) as a function of input dimension d and (b) as a function of\nthe training sample size ntr are plotted in log scale in Figure 1. Error bars are omitted for clear\nvisibility\u2014instead, the best method in terms of the mean error and comparable ones based on the\nt-test at the signi\ufb01cance level 1% are indicated by \u2018\u25e6\u2019; the methods with signi\ufb01cant difference are\nindicated by \u2018\u00d7\u2019. Figure 1(a) shows that the error of KDE sharply increases as the input dimension\ngrows, while LogReg, KLIEP, and uLSIF tend to give much smaller errors than KDE. This would\nbe the fruit of directly estimating the importance without going through density estimation. KMM\ntends to perform poorly, which is caused by an inappropriate choice of the Gaussian kernel width.\nThis implies that the popular heuristic of using the median distance between samples as the Gaussian\nwidth is not always appropriate. On the other hand, model selection in LogReg, KLIEP, and uLSIF\nseems to work quite well. Figure 1(b) shows that the errors of all methods tend to decrease as the\nnumber of training samples grows. Again LogReg, KLIEP, and uLSIF tend to give much smaller\nerrors than KDE and KMM.\n\nNext we investigate the computation time. Each method has a different model selection strategy,\ni.e., KMM does not involve any cross-validation, KDE and KLIEP involve cross-validation over\nthe kernel width, and LogReg and uLSIF involve cross-validation over both the kernel width and\nthe regularization parameter. Thus the naive comparison of the total computation time is not so\nmeaningful. For this reason, we \ufb01rst investigate the computation time of each importance estimation\nmethod after the model parameters are \ufb01xed. The average CPU computation time over 100 trials\nare summarized in Figure 2. Figure 2(a) shows that the computation time of KDE, KLIEP, and\nuLSIF is almost independent of the input dimensionality d, while that of KMM and LogReg is\nrather dependent on d. Among them, the proposed uLSIF is one of the fastest methods. Figure 2(b)\nshows that the computation time of LogReg, KLIEP, and uLSIF is nearly independent of the training\nsample size ntr, while that of KDE and KMM sharply increase as ntr increases.\nBoth LogReg and uLSIF have very good accuracy and their computation time after model selection\nis comparable. Finally, we compare the entire computation time of LogReg and uLSIF including\ncross-validation, which is summarized in Figure 3. We note that the Gaussian width \u03c3 and the\nregularization parameter \u03bb are chosen over the 9 \u00d7 9 equidistant grid in this experiment for both\nLogReg and uLSIF. Therefore, the comparison of the entire computation time is fair. Figures 3(a)\nand 3(b) show that uLSIF is approximately 5 to 10 times faster than LogReg.\n\n6\n\n\fOverall, uLSIF is shown to be comparable to the best existing method (LogReg) in terms of the\naccuracy, but is computationally more ef\ufb01cient than LogReg.\n\nCovariate Shift Adaptation in Regression and Classi\ufb01cation: Next, we illustrate how the im-\nportance estimation methods could be used in covariate shift adaptation [8, 5, 1, 9]. Covariate shift is\na situation in supervised learning where the input distributions change between the training and test\nphases but the conditional distribution of outputs given inputs remains unchanged. Under covariate\nshift, standard learning techniques such as maximum likelihood estimation or cross-validation are\nbiased; the bias caused by covariate shift can be asymptotically canceled by weighting the samples\naccording to the importance. In addition to training input samples {xtr\ni=1 following a training\nj=1 following a test input density pte(x), suppose\ninput density ptr(x) and test input samples {xte\nthat training output samples {ytr\ni=1 are given. The task is to\npredict the outputs for test inputs.\n\ni=1 at the training input points {xtr\n\nj }nte\n\ni }ntr\n\ni }ntr\n\ni }ntr\n\nWe use the kernel model\n\nbf (x; \u03b8) = Pt\n\n\u2113=1 \u03b8\u2113Kh(x, m\u2113)\n\nfor function learning, where Kh(x, x\u2032) is the Gaussian kernel (5) and m\u2113 is a template point ran-\nj=1. We set the number of kernels at t = 50. We learn the parameter \u03b8 by\ndomly chosen from {xte\nimportance weighted regularized least-squares (IWRLS):\n\nj }nte\n\nmin\u03b8hPntr\n\ni )(cid:16)bf (xtr\n\ni=1 bw(xtr\n\ni (cid:17)2\ni ; \u03b8) \u2212 ytr\n\n+ \u03b3k\u03b8k2i.\n\n(9)\n\nIt is known that IWRLS is consistent when the true importance w(xtr\ni ) is used as weights\u2014\nunweighted RLS is not consistent due to covariate shift, given that the true learning target function\n\nf (x) is not realizable by the model bf (x) [8].\n\nThe kernel width h and the regularization parameter \u03b3 in IWRLS (9) are chosen by importance\nweighted CV (IWCV) [9]. More speci\ufb01cally, we \ufb01rst divide the training samples {ztr\ni =\ni\n(xtr\nj }j6=r\nby IWRLS and its mean test error for the remaining samples Z tr\n\ni=1 into R disjoint subsets {Z tr\n\ni )}ntr\n\ni , ytr\n\nr }R\n\n| ztr\n\nr | P(x,y)\u2208Z tr\nwhere loss (by, y) is (by \u2212 y)2 in regression and 1\n\n|Z tr\n\n1\n\nr is computed:\n\nr=1. Then a function bfr(x) is learned using {Z tr\nr bw(x)loss(cid:16)bfr(x), y(cid:17) ,\n2 (1 \u2212 sign{byy}) in classi\ufb01cation. We repeat this\n\n(10)\n\nprocedure for r = 1, . . . , R and choose the kernel width h and the regularization parameter \u03b3 so\nthat the average of the above mean test error over all r is minimized. We set the number of folds in\nIWCV at R = 5. IWCV is shown to be an (almost) unbiased estimator of the generalization error,\nwhile unweighted CV with misspeci\ufb01ed models is biased due to covariate shift.\n\nThe datasets provided by DELVE and IDA are used for performance evaluation, where training in-\nput points are sampled with bias in the same way as [9]. We set the number of samples at ntr = 100\nand nte = 500 for all datasets. We compare the performance of KDE, KMM, LogReg, KLIEP, and\nuLSIF, as well as the uniform weight (Uniform, i.e., no adaptation is made). The experiments are\nrepeated 100 times for each dataset and evaluate the mean test error:\nj ).\nThe results are summarized in Table 1, where all the error values are normalized by that of the uni-\nform weight (no adaptation). For each dataset, the best method and comparable ones based on the\nWilcoxon signed rank test at the signi\ufb01cance level 1% are described in bold face. The upper half cor-\nresponds to regression datasets taken from DELVE while the lower half correspond to classi\ufb01cation\ndatasets taken from IDA.\n\nj=1 loss(bf (xte\n\nnte Pnte\n\nj ), yte\n\n1\n\nThe table shows that the generalization performance of uLSIF tends to be better than that of Uniform,\nKDE, KMM, and LogReg, while it is comparable to the best existing method (KLIEP). The mean\ncomputation time over 100 trials is described in the bottom row of the table, where the value is\nnormalized so that the computation time of uLSIF is one. This shows that uLSIF is computationally\nmore ef\ufb01cient than KLIEP. Thus, proposed uLSIF is overall shown to work well in covariate shift\nadaptation with low computational cost.\n\nOutlier Detection: Here, we consider an outlier detection problem of \ufb01nding irregular samples\nin a dataset (\u201cevaluation dataset\u201d) based on another dataset (\u201cmodel dataset\u201d) that only contains\n\n7\n\n\fKDE\n\nKMM\n\nTable 1: Covariate shift adaptation. Mean and standard\ndeviation of test error over 100 trials (smaller is better).\nDataset\nUniform\n1.00(0.34)\nkin-8fh\n1.00(0.39)\nkin-8fm\n\u25e61.00(0.26)\nkin-8nh\nkin-8nm \u25e61.00(0.30)\n\u25e61.00(0.50)\nabalone\n\u25e61.00(0.51)\nimage\n1.00(0.04)\n1.00(0.58)\n1.00(0.45)\n1.00(0.38)\n\nLogReg\n1.31(0.39) \u25e60.95(0.31) \u25e61.02(0.33)\n1.22(0.52)\n1.38(0.57) \u25e60.86(0.35) \u25e60.88(0.39)\n1.12(0.57)\n1.09(0.19) \u25e60.99(0.22) \u25e61.02(0.18)\n1.09(0.20)\n1.12(0.21) \u25e60.97(0.25)\n1.14(0.26)\n1.04(0.25)\n1.02(0.41) \u25e60.91(0.38) \u25e60.97(0.49) \u25e60.97(0.69) \u25e60.96(0.61)\n1.08(0.54) \u25e60.98(0.46) \u25e60.94(0.44) \u25e60.98(0.47)\n0.98(0.45)\n0.87(0.04) \u25e60.87(0.04)\n0.91(0.08)\n1.16(0.71) \u25e60.94(0.57) \u25e60.91(0.61) \u25e60.91(0.52) \u25e60.88(0.57)\n0.98(0.31) \u25e60.93(0.32) \u25e60.93(0.34) \u25e60.92(0.32)\n1.05(0.47)\n1.07(0.40)\n1.17(0.37)\n0.96(0.36)\n\n1.55(0.39)\n1.84(0.58)\n1.19(0.29)\n1.20(0.20)\n\nringnorm\ntwonorm\nwaveform\nAverage\n\n0.95(0.35)\n\n1.07(0.37)\n\nKLIEP\n\n0.95(0.08)\n\n0.99(0.06)\n\nuLSIF\n\nTime\n\n\u2014\n\n0.82\n\n3.50\n\n3.27\n\n3.64\n\n1.00\n\nTable 2: Outlier detection. Mean AUC\nvalues over 20 trials (larger is better).\nDataset\nbanana\nb-cancer\ndiabetes\nf-solar\ngerman\nheart\nimage\nsplice\nthyroid\ntitanic\nt-norm\nw-form\nAverage\n\nuLSIF KLIEP LogReg KMM OSVM LOF KDE\n.915 .934\n.851\n.488 .400\n.463\n.558\n.403 .425\n.441 .378\n.416\n.559 .561\n.574\n.659 .638\n.659\n.930 .916\n.812\n.778 .845\n.713\n.111 .256\n.534\n.525\n.525 .461\n.889 .875\n.905\n.887 .861\n.890\n.629 .623\n.661\n1.00\n85.5 8.70\n\n.578\n.576\n.574\n.494\n.529\n.623\n.813\n.541\n.681\n.502\n.439\n.477\n.608\n751\n\n.447\n.627\n.599\n.438\n.556\n.833\n.600\n.368\n.745\n.602\n.161\n.243\n.530\n5.35\n\n.815\n.480\n.615\n.485\n.572\n.647\n.828\n.748\n.720\n.534\n.902\n.881\n.685\n11.7\n\n.360\n.508\n.563\n.522\n.535\n.681\n.540\n.737\n.504\n.456\n.846\n.861\n.596\n12.4\n\nTime\n\nregular samples. De\ufb01ning the importance over two sets of samples, we can see that the importance\nvalues for regular samples are close to one, while those for outliers tend to be signi\ufb01cantly deviated\nfrom one. Thus the importance values could be used as an index of the degree of outlyingness in\nthis scenario. Since the evaluation dataset has wider support than the model dataset, we regard the\nevaluation dataset as the training set (i.e., the denominator in the importance) and the model dataset\nas the test set (i.e., the numerator in the importance). Then outliers tend to have smaller importance\nvalues (i.e., close to zero).\n\nWe again test KMM, LogReg, KLIEP, and uLSIF for importance estimation; in addition, we test\nnative outlier detection methods such as the one-class support vector machine (OSVM) [7], the\nlocal outlier factor (LOF) [3], and the kernel density estimator (KDE). The datasets provided by\nIDA are used for performance evaluation. These datasets are binary classi\ufb01cation datasets consisting\nof training and test samples. We allocate all positive training samples for the \u201cmodel\u201d set, while all\npositive test samples and 1% of negative test samples are assigned in the \u201cevaluation\u201d set. Thus, we\nregard the positive samples as regular and the negative samples as irregular.\n\nThe mean AUC values over 20 trials as well as the computation time are summarized in Table 2,\nshowing that uLSIF works fairly well. KLIEP works slightly better than uLSIF, but uLSIF is com-\nputationally much more ef\ufb01cient. LogReg overall works rather well, but it performs poorly for\nsome datasets and therefore the average AUC value is small. KMM and OSVM are not comparable\nto uLSIF both in AUC and computation time. LOF and KDE work reasonably well in terms of\nAUC, but their computational cost is high. Thus, proposed uLSIF is overall shown to work well and\ncomputationally ef\ufb01cient also in outlier detection.\n\n6 Conclusions\n\nWe proposed a new method for importance estimation that can avoid solving a substantially more\ndif\ufb01cult task of density estimation. We are currently exploring various possible applications of\nimportant estimation methods beyond covariate shift adaptation and outlier detection, e.g., feature\nselection, conditional distribution estimation, and independent component analysis\u2014we believe that\nimportance estimation could be used as a new versatile tool in machine learning.\n\nReferences\n[1] S. Bickel et al. Discriminative learning for differing training and test distributions. ICML 2007.\n[2] S. Bickel et al. Dirichlet-enhanced spam \ufb01ltering based on biased samples. NIPS 2006.\n[3] M. M. Breunig et al. LOF: Identifying density-based local outliers. SIGMOD 2000.\n[4] T. Hastie et al. The entire regularization path for the support vector machine. JMLR 2004.\n[5] J. Huang et al. Correcting sample selection bias by unlabeled data. NIPS 2006.\n[6] X. Nguyen et al. Estimating divergence functions and the likelihood ratio. NIPS 2007.\n[7] B. Sch\u00a8olkopf et al. Estimating the support of a high-dimensional distribution. Neural Computation,\n\n13(7):1443\u20131471, 2001.\n\n[8] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[9] M. Sugiyama et al. Direct importance estimation with model selection. NIPS 2007.\n\n8\n\n\f", "award": [], "sourceid": 140, "authors": [{"given_name": "Takafumi", "family_name": "Kanamori", "institution": null}, {"given_name": "Shohei", "family_name": "Hido", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}]}