{"title": "Inverse Density as an Inverse Problem: the Fredholm Equation Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1484, "page_last": 1492, "abstract": "We address the problem of estimating the ratio $\\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function.  Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average a function with respect to one probability distribution, given a sample from another. It is often referred as {\\it importance sampling} in statistical inference and is  also closely related to the problem of {\\it covariate shift} in transfer learning as well as to various MCMC methods. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as the Fredholm problem of the first kind.   This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically.  The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible,  simple and  easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on $\\R^d$ and smooth $d$-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. Interestingly, it turns out that in the density ratio estimation setting, when samples from both distributions are available, there are simple completely unsupervised methods for choosing parameters. We  call this model selection mechanism CD-CV for Cross-Density Cross-Validation. Finally, we show encouraging experimental results including applications to classification  within the covariate shift framework.", "full_text": "Inverse Density as an Inverse Problem:\n\nthe Fredholm Equation Approach\n\nQichao Que, Mikhail Belkin\n\nDepartment of Computer Science and Engineering\n{que,mbelkin}@cse.ohio-state.edu\n\nThe Ohio State University\n\nAbstract\n\nWe address the problem of estimating the ratio q\np where p is a density function\nand q is another density, or, more generally an arbitrary function. Knowing or ap-\nproximating this ratio is needed in various problems of inference and integration\noften referred to as importance sampling in statistical inference. It is also closely\nrelated to the problem of covariate shift in transfer learning. Our approach is\nbased on reformulating the problem of estimating the ratio as an inverse problem\nin terms of an integral operator corresponding to a kernel, known as the Fredholm\nproblem of the \ufb01rst kind. This formulation, combined with the techniques of reg-\nularization leads to a principled framework for constructing algorithms and for\nanalyzing them theoretically. The resulting family of algorithms (FIRE, for Fred-\nholm Inverse Regularized Estimator) is \ufb02exible, simple and easy to implement.\nWe provide detailed theoretical analysis including concentration bounds and con-\nvergence rates for the Gaussian kernel for densities de\ufb01ned on Rd and smooth\nd-dimensional sub-manifolds of the Euclidean space.\nModel selection for unsupervised or semi-supervised inference is generally a dif\ufb01-\ncult problem. It turns out that in the density ratio estimation setting, when samples\nfrom both distributions are available, simple completely unsupervised model se-\nlection methods are available. We call this mechanism CD-CV for Cross-Density\nCross-Validation. We show encouraging experimental results including applica-\ntions to classi\ufb01cation within the covariate shift framework.\n\nIntroduction\n\n1\nIn this paper we address the problem of estimating the ratio of two functions, q(x)\np(x) where p is given\nby a sample and q(x) is either a known function or another probability density function given by a\nsample. This estimation problem arises naturally when one attempts to integrate a function with re-\nspect to one density, given its values on a sample obtained from another distribution. Recently there\nhave been a signi\ufb01cant amount of work on estimating the density ratio (also known as the impor-\ntance function) from sampled data, e.g., [6, 10, 9, 22, 2]. Many of these papers consider this problem\nin the context of covariate shift assumption [19] or the so-called selection bias [27]. The approach\ntaken in our paper is based on reformulating the density ratio estimation as an integral equation,\nknown as the Fredholm equation of the \ufb01rst kind, and solving it using the tools of regularization\nand Reproducing Kernel Hilbert Spaces. That allows us to develop simple and \ufb02exible algorithms\nfor density ratio estimation within the popular kernel learning framework. The connection to the\nclassical operator theory setting makes it easier to apply the standard tools of spectral and Fourier\nanalysis to obtain theoretical results.\nWe start with the following simple equality underlying the importance sampling method:\n\nZ\n\nZ\n\n(cid:18)\n\nh(x) q(x)\np(x)\n\n(cid:19)\n\n(1)\n\nEq(h(x)) =\n\nh(x)q(x)dx =\n\nh(x) q(x)\n\np(x) p(x)dx = Ep\n\n1\n\n\fZ\n\nZ\n\nBy replacing the function h(x) with a kernel k(x, y), we obtain\n\nKp\n\nq\np\n\n(x) :=\n\nk(x, y) q(y)\n\np(y) p(y)dy =\n\nk(x, y)q(y)dy := Kq1(x).\n\n(2)\n\nThinking of the function q(x)\np(x) as an unknown quantity and assuming that the right hand side is known\nthis becomes a Fredholm integral equation. Note that the right-hand side can be estimated given a\nsample from q while the operator on the left can be estimated using a sample from p.\nTo push this idea further, suppose kt(x, y) is a \u201clocal\u201d kernel, (e.g., the Gaussian, kt(x, y) =\n(2\u03c0t)d/2 e\u2212 kx\u2212yk2\nRd kt(x, y)dx = 1. When we use \u03b4-kernels, like Gaussian, and f\nRd kt(x, y)f(x)dx = f(y) + O(t) (see [24], Ch.\n\nsatis\ufb01es some smoothness conditions, we haveR\n\n) such thatR\n\n2t\n\n1\n\n1). Thus we get another (approximate) integral equality:\nkt(x, y) q(x)\n\n(y) :=\n\nKt,p\n\nq\np\n\np(x) p(x)dx \u2248 q(y).\n\nZ\n\nRd\n\n(3)\n\nIt becomes an integral equation for q(x)\n\np(x), assuming that q is known or can be approximated.\n\nn\n\nL2,p\n\nL2,p\n\n\u2248 1\n\nn\n\nL2,p\n\n\u2248 arg min\n\n\u2248 arg min\n\nf\u2208H kKpf\u2212Kq1(x)k2\n\nWe address these inverse problems by formulating them within the classical framework of Tiknonov-\nPhilips regularization with the penalty term corresponding to the norm of the function in the Repro-\nducing Kernel Hilbert Space H with kernel kH used in many machine learning algorithms.\n+\u03bbkfk2H\n[Type I]: q\np\nImportantly, given a sample x1, . . . , xn from p, the integral operator Kpf applied to a function f\ncan be approximated by the corresponding discrete sum Kpf(x) \u2248 1\ni f(xi)K(xi, x), while\nL2,p norm is approximated by an average: kfk2\ni f(xi)2. Of course, the same holds for\na sample from q. We see that the Type I formulation is useful when q is a density and samples from\nboth p and q are available, while the Type II is useful, when the values of q (which does not have to\nbe a density function at all1) are known at the data points sampled from p.\nSince all of these involve only function evaluations at the sample points, an application of the usual\nrepresenter theorem for Reproducing Kernel Hilbert Spaces, leads to simple, explicit and easily\nimplementable algorithms, representing the solution of the optimization problem as linear combi-\ni \u03b1ikH(xi, x) (see Section 2). We call the\n\nnations of the kernels over the points of the sampleP\n\n+\u03bbkfk2H [II]: q\np\nP\n\nf\u2208H kKt,pf\u2212qk2\nP\n\nresulting algorithms FIRE for Fredholm Inverse Regularized Estimator.\nRemark: Other norms and loss functions. Norms and loss functions other that L2,p can also be\nused in our setting as long as they can be approximated from a sample using function evaluations.\n1. Perhaps, the most interesting is L2,q norm available in the Type I setting, when a sample from\nthe probability distribution q is available. In fact, given a sample from both p and q we can use the\ncombined empirical norm \u03b3k \u00b7 kL2,p + (1 \u2212 \u03b3)k \u00b7 kL2,q. Optimization using those norms leads to\nsome interesting kernel algorithms described in Section 2. We note that the solution is still a linear\ncombination of kernel functions centered on the sample from p and can still be written explicitly.\n2. In Type I formulation, if the kernels k(x, y) and kH(x, y) coincide, it is possible to use the RKHS\nnorm k \u00b7 kH instead of L2,p. This formulation (see Section 2) also yields an explicit formula and is\nrelated to the Kernel Mean Matching [9] , although with a different optimization procedure.\nSince we are dealing with a classical inverse problem for integral operators, our formulation allows\nfor theoretical analysis using the methods of spectral theory. In Section 3 we present concentration\nand error bounds as well as convergence rates for our algorithms when data are sampled from a\ndistribution de\ufb01ned in Rd, a domain in Rd with boundary or a compact d-dimensional sub-manifold\nof a Euclidean space RN for the case of the Gaussian kernel.\nIn Section 4 we introduce a unsupervised method, referred as CD-CV (for cross-density cross-\nvalidation) for model selection and discuss the experimental results on several data sets comparing\nour method FIRE with the available alternatives, Kernel Mean Matching (KMM) [9] and LSIF [10]\nas well as the base-line thresholded inverse kernel density estimator2 (TIKDE) and importance sam-\npling (when available).\n\n1This could be useful in sampling procedures, when the normalizing coef\ufb01cients are hard to estimate.\n2The standard kernel density estimator for q divided by a thresholded kernel density estimator for p.\n\n2\n\n\fWe summarize the contributions of the paper as follows:\n1. We provide a formulation of estimating the density ratio (importance function) as a classical\ninverse problem, known as the Fredholm equation, establishing a connections to the methods of\nclassical analysis. The underlying idea is to \u201clinearize\u201d the properties of the density by studying an\nassociated integral operator.\n2. To solve the resulting inverse problems we apply regularization with an RKHS norm penalty. This\nprovides a \ufb02exible and principled framework, with a variety of different norms and regularization\ntechniques available. It separates the underlying inverse problem from the necessary regularization\nand leads to a family of very simple and direct algorithms within the kernel learning framework in\nmachine learning.\n3. Using the techniques of spectral analysis and concentration, we provide a detailed theoretical\nanalysis for the case of the Gaussian kernel, for Euclidean case as well as for distributions supported\non a sub-manifold. We prove error bounds and as well as the convergence rates.\n4. We also propose a completely unsupervised technique, CD-CV, for cross-validating the parame-\nters of our algorithm and demonstrate its usefulness, thus addressing in our setting one of the most\nthorny issues in unsupervised/semi-supervised learning. We evaluate and compare our methods on\nseveral different data sets and in various settings and demonstrate strong performance and better\ncomputational ef\ufb01ciency compared to the alternatives.\nRelated work. Recently the problem of density ratio estimation has received signi\ufb01cant attention\ndue in part to the increased interest in transfer learning [15] and, in particular to the form of transfer\nlearning known as covariate shift [19]. To give a brief summary, given the feature space X and the\nlabel space Y , two probability distributions p and q on X \u00d7 Y satisfy the covariate assumption if\nfor all x, y, p(y|x) = q(y|x). It is easy to see that training a classi\ufb01er to minimize the error for q,\ngiven a sample from p requires estimating the ratio of the marginal distributions qX (x)\npX (x). The work on\ncovariate shift, density ratio estimation and related settings includes [27, 2, 6, 10, 22, 9, 23, 14, 7].\nThe algorithm most closely related to ours is Kernel Mean Matching [9]. It is based on the equation:\np\u03a6(x)), where \u03a6 is the feature map corresponding to an RKHS H. It is rewritten\nEq(\u03a6(x)) = Ep( q\np(x) \u2248 arg min\u03b2\u2208L2,\u03b2(x)>0,Ep(\u03b2)=1 kEq(\u03a6(x)) \u2212 Ep(\u03b2(x)\u03a6(x))kH.\nas an optimization problem q(x)\nThe quantity on the right can be estimated given a sample from p and a sample from q and the\nminimization becomes a quadratic optimization problem over the values of \u03b2 at the points sampled\nfrom p. Writing down the feature map explicitly, i.e., recalling that \u03a6(x) = KH(x,\u00b7), we see that\nthe equality Eq(\u03a6(x)) = Ep( q\np\u03a6(x)) is equivalent to the integral equation Eq. 2 considered as an\nidentity in the Hilbert space H. Thus the problem of KMM can be viewed within our setting Type I\n(see the Remark 2 in the introduction), with a RKHS norm but a different optimization algorithm.\nHowever, while the KMM optimization problem uses the RKHS norm, the weight function \u03b2 itself\nis not in the RKHS. Thus, unlike most other algorithms in the RKHS framework (in particular,\nFIRE), the empirical optimization problem does not have a natural out-of-sample extension. Also,\nsince there is no regularizing term, the problem is less stable (see Section 4 for some experimental\ncomparisons) and the theoretical analysis is harder (however, see [6] and the recent paper [26] for\nsome nice theoretical analysis of KMM in certain settings).\nAnother related recent algorithm is Least Squares Importance Sampling (LSIF) [10], which attempts\nto estimate the density ratio by choosing a parametric linear family of functions and choosing a\nfunction from this family to minimize the L2,p distance to the density ratio. A similar setting with\nthe Kullback-Leibler distance (KLIEP) was proposed in [23]. This has an advantage of a natural\nout-of-sample extension property. We note that our method for unsupervised parameter selection in\nSection 4 is related to their ideas. However, in our case the set of test functions does not need to\nform a good basis since no approximation is required.\nWe note that our methods are closely related to a large body of work on kernel methods in machine\nlearning and statistical estimation (e.g., [21, 17, 16]). Many of these algorithms can be interpreted\nas inverse problems, e.g., [3, 20] in the Tikhonov regularization or other regularization frameworks.\nIn particular, we note interesting methods for density estimation proposed in [12] and estimating the\nsupport of density through spectral regularization in [4], as well as robust density estimation using\nRKHS formulations [11] and conditional density [8]. We also note the connections of the methods\nin this paper to properties of density-dependent operators in classi\ufb01cation and clustering [25, 18, 1].\nAmong those works that provide theoretical analysis of algorithms for estimating density ratios,\n\n3\n\n\fP\n\n[14] establishes minimax rates for likelihood ratio estimation. Another recent theoretical analysis of\nKMM in [26] contains bounds for the output of the corresponding integral operators.\n2 Settings and Algorithms\nSettings and objects. We start by introducing objects and function spaces important for our de-\nvelopment. As usual, the norm in space of square-integrable functions with respect to a measure\n\n\u2126 |f(x)|2d\u03c1 < \u221e(cid:9) . This is a Hilbert space with the inner\n\n\u03c1, is de\ufb01ned as follows: L2,\u03c1 = (cid:8)f :R\nproduct de\ufb01ned in the usual way by hf, gi2,\u03c1 =R\nthe operator K\u03c1: K\u03c1f(y) :=R\n\n\u2126 f(x)g(x)d\u03c1. Given a kernel k(x, y) we de\ufb01ne\n\u2126 k(x, y)f(x)d\u03c1(x). We will use the notation Kt,\u03c1 to explicitly refer\nto the parameter of the kernel function kt(x, y), when it is a \u03b4-family. If the function k(x, y) is\nsymmetric and positive de\ufb01nite, then there is a corresponding Reproducing Kernel Hilbert space\n(RKHS) H. We recall the key property of the kernel kH: for any f \u2208 H, hf, kH(x,\u00b7)iH = f(x).\nThe Representer Theorem allows us to write solutions to various optimization problems over H in\nP\nterms of linear combinations of kernels supported on sample points (see [21] for an in-depth discus-\nsion or the RKHS theory and the issues related to learning). Given a sample x1, . . . , xn from p, one\ncan approximate the L2,p norm of a suf\ufb01ciently smooth functionf by kfk2\ni |f(xi)|2, and\nsimilarly, the integral operator Kpf(x) \u2248 1\ni k(xi, x)f(xi). These approximate equalities can\nbe made precise by using appropriate concentration inequalities.\nThe FIRE Algorithms. As discussed in the introduction, the starting point for our development is\nthe two integral equalities,\np(y) dp(y) = q(\u00b7) + o(1)\n[I]: Kp\n\np(y) dp(y) = Kq1(\u00b7) [II]:Kt,p\n\nkt(\u00b7, y) q(y)\n\nk(\u00b7, y) q(y)\n\nq , known as Fredholm equations of the \ufb01rst kind. To estimate p\n\n(4)\nNotice that in the Type I setting, the kernel does not have to be in a \u03b4-family. For example, a linear\nkernel is admissible. Type II setting comes from the fact Kt,qf(x) \u2248 f(x)p(x) + O(t) for a \u201c\u03b4-\nfunction-like\u201d kernel and we keep t in the notation in that case. Assuming that either Kq1 or q are\n(approximately) known (Type I and II settings, respectively) equalities in Eqs. 4 become integral\nequations for p\nq , we need to obtain\nan approximation to the solution which (a) can be obtained computationally from sampled data, (b)\nis stable with respect to sampling and other perturbation of the input function, (c) can be analyzed\nusing the standard machinery of functional analysis.\nTo provide a framework for solving these inverse problems, we apply the classical techniques of\nregularization combined with the RKHS norm popular in machine learning. In particular a simple\nformulation of Type I using Tikhonov regularization, ([5], Ch. 5), with the L2,p norm is as follows:\n(5)\n\n[Type I]:\n\n2,p \u2248 1\n\n(\u00b7) =\n\n(\u00b7) =\n\n2,p + \u03bbkfk2H\n\nf I\n\u03bb = arg min\n\nf\u2208HkKpf \u2212 Kq1k2\n\nHere H is an appropriate Reproducing Kernel Hilbert Space. Similarly Type II can be solved by\n\nZ\n\nZ\n\nq\np\n\nq\np\n\nn\n\nn\n\n[Type II]:\n\nf II\n\u03bb = arg min\n\nf\u2208H kKt,pf \u2212 qk2\n\n2,p + \u03bbkfk2H\n\n(6)\n\nWe will now discuss the empirical versions of these equations and the resulting algorithms.\nType I setting. Algorithm for L2,p norm. Given an iid sample from p, zp = {xi}n\nan iid sample from q, zq = {x0\nP\nintegral operators Kp and Kq by Kzpf(x) = 1\n\ni=1 and\nj}m\nj=1 (z for the combined sample), we can approximate the\nand Kzq f(x) =\nX\n\ni). Thus the empirical version of Eq. 5 becomes\n\nk(xi, x)f(xi)\n\nP\n\ni\u2208zq\nx0\n\nk(x0\n\nxi\u2208zp\n\n((Kzpf)(xi) \u2212 (Kzq1)(xi))2 + \u03bbkfk2H\n\ni, x)f(x0\nf I\n\u03bb,z = arg min\nf\u2208H\n\n(7)\n\n1\nm\n\nn\n\n1\nn\n\nxi\u2208zp\n\nThe \ufb01rst term of the optimization problem involves only evaluations of the function f at the points\nof the sample. From Representer Theorem and matrix manipulation, we obtain the following:\n\n\u03bb,z(x) = X\n\nf I\n\nkH(xi, x)vi and v =(cid:0)K 2\n\np,pKH + n\u03bbI(cid:1)\u22121\n\nKp,pKp,q1zq .\n\n(8)\n\nxi\u2208zp\n\nwhere the kernel matrices are de\ufb01ned as follows: (Kp,p)ij = 1\nxi, xj \u2208 zp and Kp,q is de\ufb01ned as (Kp,q)ij = 1\n\nm k(xi, x0\n\nj) for xi \u2208 zp and x0\n\nj \u2208 zq.\n\nn k(xi, xj), (KH)ij = kH(xi, xj) for\n\n4\n\n\f(cid:0)K 3\np,p + \u03bbI(cid:1)\u22121\n\nX\n\nKp,pKp,q1zq.\n\n(9)\nm} we\n\n2, . . . , x0\n\nf *\n\u03bb = arg min\n\narg min\nf\u2208H\n\n\u03b3\nn\n\n\u03bb,z(x) =\n1 \u2212 \u03b3\nm\n\n2,p + (1 \u2212 \u03b3)kKpf \u2212 Kq1k2\n\nX\n(cid:18) \u03b3\n\nxi\u2208zp\n\nIf KH and Kp,p are the same kernel we simply have: v = 1\nn\nAlgorithms for \u03b3L2,p +(1\u2212 \u03b3)L2,q norm. Depending on the setting, we may want to minimize the\nerror of the estimate over the probability distribution p, q or over some linear combination of these.\nA signi\ufb01cant potential bene\ufb01t of using a linear combination is that both samples can be used at the\nsame time in the loss function. First we state the continuous version of the problem:\n2,q + \u03bbkfk2H\n1, x0\n\nf\u2208H \u03b3kKpf \u2212 Kq1k2\nGiven a sample from p, zp = {x1, x2, . . . , xn} and a sample from q, zq = {x0\ni )(cid:1)2 +\n(cid:0)Kzpf(xi) \u2212 Kzq1(xp\nobtain an empirical version of the Eq. 9: f\u2217\n\u03bb,z(x) =P\n(cid:19)\n\ni)(cid:1)2 + \u03bbkfk2\n(cid:19)\nv = (K + n\u03bbI)\u22121 K11zq\nm k(xi, x0\nj)\nj \u2208 zq. Despite the loss function combining both samples,\n\nn k(xi, xj), (KH)ij = kH(xi, xj) for xi, xj \u2208 zp, and (Kp,q)ij = 1\nj, xi) for xi \u2208 zp,x0\n\nwhere (Kp,p)ij = 1\nn k(x0\nand (Kq,p)ji = 1\nthe solution is still a summation of kernels over the points in the sample from p.\nAlgorithms for the RKHS norm. In addition to using the RKHS norm for regularization norm, we\n\u03bb = arg minf\u2208H kKpf \u2212 Kq1k2H0 + \u03bbkfk2H Here the Hilbert\ncan also use it as a loss function: f *\nspace H0 must correspond to the kernel k and can potentially be different from the space H used for\nregularization. Note that this formulation is only applicable in the Type I setting since it requires\nthe function q to belong to the RKHS H0. Given two samples zp, zq, it is easy to write down the\nempirical version of this problem, leading to the following formula:\n\n(cid:0)(Kzpf)(x0\n(cid:18) \u03b3\n\nFrom the Representer Theorem f\u2217\n\ni\u2208zq\nx0\nvikH(xi, x)\n\nK T\n\nq,pKq,p\n\nKH and K1 =\n\ni) \u2212 (Kzq1)(x0\n\nK =\n\n(Kp,p)2 +\n\nKp,pKp,q +\n\nn\n\n1 \u2212 \u03b3\nm\n\n1 \u2212 \u03b3\nm\n\nK T\n\nq,pKq,q\n\nxi\u2208zp\n\nn\n\nH\n\nvikH(xi, x)\n\nv = (Kp,pKH + n\u03bbI)\u22121 Kp,q1zq .\n\n(10)\n\n\u03bb,z(x) = X\n\nf\u2217\n\nxi\u2208zp\n\nThe result is somewhat similar to our Type I formulation with the L2,p norm. We note the connection\nbetween this formulation of using the RKHS norm as a loss function and the KMM algorithm [9].\nWhen the kernels K and KH are the same, Eq. 10 can be viewed as a regularized version of KMM\n(with a different optimization procedure).\nType II setting. In Type II setting we assume that we have a sample z = {xi}n\ni=1 drawn from p\nand that we know the function values q(xi) at the points of the sample. Replacing the norm and the\nintegral operator with their empirical versions, we obtain the following optimization problem:\n\nAs before, using the Representer Theorem we obtain an analytical formula for the solution:\n\nf II\n\u03bb,z = arg min\nf\u2208H\n\n\u03bb,z(x) = X\n\nf II\n\nxi\u2208z\n\n(Kt,zpf(xi) \u2212 q(xi))2 + \u03bbkfk2H\n\nkH(xi, x)vi where v =(cid:0)K 2KH + n\u03bbI(cid:1)\u22121\n\nKq.\n\n(11)\n\nX\n\nxi\u2208z\n\n1\nn\n\nn kt(xi, xj), (KH)ij = kH(xi, xj) and qi = q(xi).\n\nwhere the kernel matrix K is de\ufb01ned by Kij = 1\nComparison of type I and type II settings.\n1. In Type II setting q does not have to be a density function (i.e., non-negative and integrate to one).\n2. Eq. 7 of the Type I setting cannot be easily solved in the absence of a sample zq from q, since esti-\nmating Kq requires either sampling from q (if it is a density) or estimating the integral in some other\nway, which may be dif\ufb01cult in high dimension but perhaps of interest in certain low-dimensional\napplication domains.\n3. There are a number of problems (e.g., many problems involving MCMC) where q(x) is known\nexplicitly (possibly up to a multiplicative constant), while sampling from q is expensive or even\nimpossible computationally [13].\n4. Unlike Eq. 5, Eq. 6 has an error term depending on the kernel. For example, in the important case\nof the Gaussian kernel, the error is of the order O(t), where t is the variance of Gaussian.\n5. Several norms are available in the Type I setting, but only the L2,p norm is available for Type II.\n\n5\n\n\f3 Theoretical analysis: bounds and convergence rates for Gaussian Kernels\nIn this section, we state our main results on bounds and convergence rates for our algorithm based\non Tikhonov regularization with a Gaussian kernel. We consider both Type I and Type II settings\nfor the Euclidean and manifold cases and make a remark on the Euclidean domains with boundary.\nTo simplify the theoretical development, the integral operator and the RKHS H will correspond to\nthe same Gaussian kernel kt(x, y). The proofs will be found in the supplemental material.\nAssumptions: The set \u2126, where the density function p is de\ufb01ned, could be one of the following: (1)\nthe whole Rd; (2) a compact smooth Riemannian sub-manifold M of d-dimension in Rn. We also\nneed p(x) < \u0393, q(x) < \u0393 for any x \u2208 \u2126 and that q\nTheorem 1. ( Type I setting.) Let p and q be two density functions on \u2126. Given n points, zp =\nm}, i.i.d. sampled from\n{x1, x2, . . . , xn}, i.i.d. sampled from p and m points, zq = {x0\nq, and for small enough t, for the solution to the optimization problem in (7), with con\ufb01dence at\nleast 1 \u2212 2e\u2212\u03c4 , we have\n(cid:19)\n(1) If the domain \u2126 is Rd, for some constants C1, C2, C3 independent of t, \u03bb.\n1\n\u221a\n\u03bb1/6\n\np2 are in Sobolev space W 2\n\n\u2264 C1t + C2\u03bb\n\n\u03bb,z \u2212 q\np\n\n2, . . . , x0\n\n\u03c4\n\u03bbtd/2\n\n2 + C3\n\n2 (\u2126).\n\n1, x0\n\np , q\n\n(12)\n\n\u221a\n\n+\n\nn\n\n1\n\n(2) If the domain \u2126 is a compact sub-manifold without boundary of d dimension, for some 0 < \u03b5 < 1,\nC1, C2, C3 independent of t, \u03bb.\n\n(cid:13)(cid:13)(cid:13)(cid:13)f I\n(cid:13)(cid:13)(cid:13)(cid:13)f I\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2,p\n\n(cid:13)(cid:13)(cid:13)(cid:13)2,p\n(cid:13)(cid:13)(cid:13)(cid:13)2,p\n(cid:16)\u221a\n\n\u03c4 n\n\nm\n\n(cid:18) 1\u221a\n(cid:18) 1\u221a\n(cid:16)\u221a\n\nm\n\n= O\n\n\u221a\n\n\u03c4\n\u03bbtd/2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2,p\n\n(cid:19)\n\n+\n\n1\n\u221a\n\u03bb1/6\n\nn\n\nCorollary 2. ( Type I setting.) Assuming m > \u03bb1/3n, with con\ufb01dence at least 1 \u2212 2e\u2212\u03c4 , when\n(1) \u2126 = Rd, (2) \u2126 is a d-dimensional sub-manifold of a Euclidean space, we have\n\n= O\n\n\u2212 1\n\n3.5+d/2\n\n(2)\n\n\u2212\n\n\u03c4 n\n\n1\n\n3.5(1\u2212\u03b5)+d/2\n\n(13)\n\n(cid:17)\u2200\u03b5 \u2208 (0, 1)\n\n\u03bb,z \u2212 q\np\n\n\u2264 C1t1\u2212\u03b5 + C2\u03bb\n\n1\n\n2 + C3\n\n(cid:13)(cid:13)(cid:13)(cid:13)f I\n\n(1)\n\n\u03bb,z \u2212 q\np\n\n(cid:17)\n\n(cid:13)(cid:13)(cid:13)(cid:13)f I\n\n\u03bb,z \u2212 q\np\n\n(cid:13)(cid:13)(cid:13)(cid:13)2,p\n(cid:13)(cid:13)(cid:13)(cid:13)2,p\n\nTheorem 3. ( Type II setting.) Let p be a density function on \u2126 and q be a function satisfying the\nassumptions. Given n points z = {x1, x2, . . . , xn} sampled i.i.d. from p, and for suf\ufb01ciently small\nt, for the solution to the optimization problem in (11), with con\ufb01dence at least 1 \u2212 2e\u2212\u03c4 , we have\n(1) If the domain \u2126 is Rd,\n\n\u221a\n\n(cid:13)(cid:13)(cid:13)(cid:13)f II\n(cid:13)(cid:13)(cid:13)(cid:13)f II\n(cid:13)(cid:13)(cid:13)(cid:13)f II\n\n\u03bb,z \u2212 q\np\n\n\u03bb,z \u2212 q\np\n\n\u2264C1t + C2\u03bb\n\n1\n\n2 + C3\u03bb\u2212 1\n\n3 kKt,q1 \u2212 qk2,p + C4\n\n\u03c4\n\u03bb3/2td/2\n\n\u221a\n\n,\n\nn\n\n(14)\n\nwhere C1, C2, C3, C4 are constants independent of t, \u03bb. Moreover, kKt,q1 \u2212 qk2,p = O(t).\n(2) If \u2126 is a d-dimensional sub-manifold of a Euclidean space, for any 0 < \u03b5 < 1\n\u221a\n\n\u03bb,z \u2212 q\np\n\n\u2264C1t1\u2212\u03b5 + C2\u03bb\n\n1\n\n2 + C3\u03bb\u2212 1\n\n3 kKt,q1 \u2212 qk2,p + C4\n\n\u03c4\n\u03bb3/2td/2\n\n\u221a\n\n,\n\nn\n\n(15)\n\nwhere C1, C2, C3, C4 are independent of t, \u03bb. Moreover, kKt,q1 \u2212 qk2,p = O(t1\u2212\u03b7),\u2200\u03b7 > 0.\nCorollary 4. ( Type II setting.) With con\ufb01dence at least 1 \u2212 2e\u2212\u03c4 , when\n(1) \u2126 = Rd, (2) \u2126 is a d-dimensional sub-manifold of a Euclidean space, we have\n\u2212 1\u2212\u03b7\n4\u22124\u03b7+ 5\n\n(cid:18)\u221a\n\n(cid:18)\u221a\n\n\u2212 1\n4+ 5\n\n(cid:19)\n\n(cid:19)\n\n= O\n\n\u03c4 n\n\n6 d\n\n(2)\n\n= O\n\n\u03c4 n\n\n(1)\n\n6 d\n\n\u2200\u03b7 \u2208 (0, 1)\n\n(cid:13)(cid:13)(cid:13)(cid:13)f II\n\n\u03bb,z \u2212 q\np\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2,p\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2,p\n\n4 Model Selection and Experiments\nWe describe an unsupervised technique for parameter selection, Cross-Density Cross-Validation\n(CD-CV) based on a performance measure unique to our setting. We proceed to evaluate our method.\nThe setting. In our experiments, we have X p = {xp\nm}. The\ngoal is to estimate q\np, assuming that X p, X q are i.i.d. sampled from p, q respectively. Note that\n\nn} and X q = {xq\n\n1, . . . , xp\n\n1, . . . , xq\n\n6\n\n\fp is unsupervised and our algorithms typically have two parameters: the kernel width t and\n\nlearning q\nregularization parameter \u03bb.\n(cid:16)\nPerformance Measures and CD-CV Model Selection. We describe a set of performance measures\nused for parameter selection. For a given function u, we have the following importance sampling\nequality (Eq. 1): Ep(u(x)) = Eq\np, and\nX p, X q are samples from p, q respectively, we will have the following approximation to the previous\ni ) \u2248 1\nj). So after obtaining an estimate f of the ratio, we\nequation: 1\nn\ncan validate it using the following performance measure:\n\n. If f is an approximation of the true ratio q\n\nPm\n\nPn\n\nu(x) p(x)\nq(x)\n\nj=1 u(xq\n\nm\n\ni=1 u(xp\n\ni )f(xp\n\n(cid:17)\n\uf8eb\uf8ed nX\n\nFX\n\ni ) \u2212 mX\n\n\uf8f6\uf8f82\n\nJCD(f; X p, X q, U) =\n\n1\nF\n\nul(xp\n\ni )f(xp\n\nul(xq\nj)\n\n(16)\n\nl=1\n\ni=1\n\nj=1\n\ncv and X q\n\ntrain and X q\n\ntrain for training and X p\n\nwhere U = {u1, . . . , uF} is a collection of test functions. Using this performance measure allows\nvarious cross-validation procedures to be used for parameter selection. We note that this way to\nmeasure error is related to the LSIF [10] and KLIEP [23] algorithms. However, there a similar\nmeasure is used to construct an approximation to the ratio q\np using functions u1, . . . , uF as a basis.\nIn our setting, we can use test functions (e.g., linear functions) which are poorly suited as a basis for\napproximating the density ratio.\nWe will use the following two families of test functions for parameter selection: (1) Sets of ran-\ndom linear functions ui(x) = \u03b2T x where \u03b2 \u223c N(0, Id); (2) Sets of random half-space indicator\nfunctions, ui(x) = 1\u03b2T x>0.\nProcedures for parameter selection. The performance is optimized using \ufb01ve-fold cross-validation\nby splitting the data set into two parts X p\ncv for validation.\nThe range we use for kernel width t is (t0, 2t0, . . . , 29t0), where t0 is the average distance of the 10\nnearest neighbors. The range for regularization parameter \u03bb is (1e \u2212 5, 1e \u2212 6, . . . , 1e \u2212 10).\nData sets and Resampling We use two datasets, CPUsmall and Kin8nm, for regression; and USPS\nhandwritten digits for classi\ufb01cation. And we draw the \ufb01rst 500 or 1000 points from the original\ndata set as X p. To obtain X q, the following two ways of resampling, using the features or the label\ninformation, are used (along the lines of those in [6]).\nGiven a set of data with labels {(x1, y1), (x2, y2), . . . , (xn, yn)} and denoting Pi the probability of\ni\u2019th instance being chosen, we resample as follows:\n(1) Resampling using features (labels yi are not used). Pi = e(ahxi,e1i\u2212b)/\u03c3v\n1+e(ahxi,e1i\u2212b)/\u03c3v , where a, b are the\nresampling parameters, e1 is the \ufb01rst principal component, and \u03c3v is the standard deviation of the\nprojections to e1. This resampling method will be denoted by PCA(a, b).\nwhere yi \u2208 L = {1, 2, . . . , k} and Lq is a\n(2) Resampling using labels. Pi =\nsubset of the whole label set L. It only applies to binary problems obtained by aggregating different\nclasses in multi-class setting.\nTesting the FIRE algorithm. In the \ufb01rst experiment, we test our method for selecting parameters\nby focusing on the error JCD(f; X p, X q, U) in Eq. 16 for different function classes U. Parameters\nare chosen using a family of functions U1, while the performance of the parameter is measured using\nan independent function family U2. This measure is important because in practice the functions we\nare interested in may not be the ones chosen for validation.\nWe use the USPS data sets for this experiment. As a basis for comparison we use TIKDE (Thresh-\nolded Inverse Kernel Density Estimator). TIKDE estimates \u02c6p and \u02c6q respectively using Kernel Den-\nsity Estimation (KDE), and assigns \u02c6p(x) = \u03b1 to any x satisfying \u02c6p(x) < \u03b1. TIKDE then outputs\n\u02c6q/\u02c6p. We note that chosen threshold \u03b1 is key to reasonable performance. One issue of this heuristic\nis that it could underestimate at the region with high density ratio, due to the uniform thresholding.\nWe also compare our methods to LSIF [10]. In these experiments we do not compare with KMM as\nout-of-sample extension is necessary for fair comparison.\nTable 1 shows the average errors of various methods, de\ufb01ned in Eq. 16 on held-out set Xerr over 5\ntrials. We use different validation functions f cv(Columns) and error-measuring functions f err(Row).\nN is the number of random functions used for validation. The error-measuring function families U2\nare as follows: (1) Linear(L.): random linear functions f(x) = \u03b2T x where \u03b2 \u223c N(0, Id); (2) Half-\n\n(cid:26)1 yi \u2208 Lq\n\n0 Otherwise.\n\n7\n\n\fspace(H.S.): Sets of random half-space indicator functions, f(x) = 1\u03b2T x; (3) Kernel(K.): random\nlinear combinations of kernel functions centered at training data, f(x) = \u03b3T K where \u03b3 \u223c N(0, Id)\nand Kij = k(xi, xj) for xi from training set; (4) Kernel indicator(K.I.) functions f(x) = 1g(x)>0,\nwhere g is as in (3).\nTable 1: USPS data set with resampling using PCA(5, \u03c3v) with |X p| = 500, |X q| = 1371. Around\n400 in X p and 700 in X q are used in 5-fold CV, the rest are held-out for computing the error.\n\nN\nTIKDE\nLSIF\nFIREp\nFIREp,q\nFIREq\nTIKDE\nLSIF\nFIREp\nFIREp,q\nFIREq\n\nL.\n\nH.S.\n\nLinear\n50\n10.9\n14.1\n3.6\n4.7\n5.9\n2.6\n3.9\n1.0\n0.9\n1.2\n\n200\n10.9\n14.1\n3.7\n4.7\n6.2\n2.6\n3.9\n0.9\n1.0\n1.4\n\nHalf-Spaces\n50\n200\n10.9\n10.9\n28.2\n26.8\n6.3\n5.5\n6.8\n7.4\n9.3\n9.3\n2.6\n2.6\n3.9\n3.7\n1.2\n1.0\n1.4\n1.1\n1.6\n1.6\n\nN\nTIKDE\nLSIF\nFIREp\nFIREp,q\nFIREq\nTIKDE\nLSIF\nFIREp\nFIREp,q\nFIREq\n\nK.\n\nK.I.\n\nLinear\n50\n4.7\n16.1\n1.2\n2.1\n5.2\n4.2\n4.4\n0.9\n0.6\n1.2\n\n200\n4.7\n16.1\n1.1\n2.0\n4.3\n4.2\n4.4\n0.7\n0.6\n0.9\n\nHalf-Spaces\n50\n200\n4.7\n4.7\n13.8\n15.6\n3.6\n2.8\n2.6\n4.2\n6.1\n6.1\n4.2\n4.2\n4.4\n5.3\n1.1\n1.2\n1.9\n1.1\n2.2\n2.2\n\nSupervised Learning: Regression and Classi\ufb01cation. We compare our FIRE algorithm with sev-\neral other methods in regression and classi\ufb01cation tasks. We consider the situation where part of\nthe data set X p are labeled and all of X q are unlabeled. We use weighted ordinary least-square for\nregression and weighted linear SVM for classi\ufb01cation.\nRegression. Square loss function is used for regression. The performance is measured using nor-\n(\u02c6yi\u2212yi)2\nVar(\u02c6y\u2212y) . X q is resampled using PCA resampler, described before. L is\nfor Linear, and HS is for Half-Space function families for parameter selection.\nTable 2: Mean normalized square loss on the CPUsmall and Kin8nm. |X p| = 1000, |X q| = 2000.\n\nmalized square loss,Pn\n\ni=1\n\nNo. of Labeled\n\nWeights\n\nOLS\nTIKDE\nKMM\nLSIF\nFIREp\nFIREp,q\nFIREq\n\nCPUsmall, resampled by PCA(5, \u03c3v)\n\nKin8nm, resampled by PCA(1, \u03c3v)\n\n100\n\n200\n\n500\n\n100\n\n200\n\n500\n\nHS\n\nL\n\nL\n\n.74\n\nHS\n\n.50\n\n.38\n1.86\n.39\n.33\n.33\n.32\n\n.36\n1.86\n.39\n.33\n.33\n.33\n\n.30\n1.9\n.31\n.29\n.29\n.28\n\n.29\n1.9\n.31\n.29\n.29\n.29\n\nL\n\n.28\n2.5\n.33\n.27\n.27\n.27\n\nHS\n\n.83\n\n.28\n2.5\n.33\n.27\n.27\n.27\n\nL\n\n.57\n.58\n.57\n.57\n.56\n.56\n\nHS\n\n.59\n\n.57\n.58\n.56\n.56\n.56\n.56\n\nL\n\n.55\n.55\n.54\n.55\n.55\n.55\n\nHS\n\n.55\n\n.55\n.55\n.54\n.54\n.54\n.54\n\nL\n\n.53\n.52\n.52\n.52\n.52\n.52\n\nHS\n\n0.54\n\n.53\n.52\n.52\n.52\n.52\n.52\n\nClassi\ufb01cation. Weighted linear SVM. Percentage of incorrectly labeled test set instances.\nTable 3: Average error on USPS with +1 class= {0 \u2212 4}, \u22121 class= {5 \u2212 9} and |X p| = 1000\nand |X q| = 2000. Left half of the table uses resampling PCA(5, \u03c3v), where \u03c3v. Right half shows\nresampling based on Label information.\nL = {{0 \u2212 4},{5 \u2212 9}},L0 = {0, 1, 5, 6}\nPCA(5, \u03c3v)\n\nNo. of Labeled\n\n100\n\n200\n\n500\n\n100\n\n200\n\n500\n\nWeights\nSVM\nTIKDE\nKMM\nLSIF\nFIREp\nFIREp,q\nFIREq\n\nL\n\nHS\n\n10.2\n\n9.4\n8.1\n9.5\n8.9\n7.0\n5.5\n\n9.4\n8.1\n10.2\n6.8\n7.0\n7.3\n\nL\n\n7.2\n5.9\n7.3\n5.3\n5.1\n4.8\n\nHS\n\n8.1\n\n7.2\n5.9\n8.1\n5.0\n5.1\n5.4\n\nL\n\n4.9\n4.7\n5.0\n4.1\n4.1\n4.1\n\nHS\n\n5.7\n\nL\n\nHS\n\n18.6\n\nL\n\nHS\n\n16.4\n\n4.9\n4.7\n5.7\n4.1\n4.1\n4.4\n\n18.5\n17.5\n18.5\n17.9\n18.0\n18.3\n\n18.5\n17.5\n18.5\n18.4\n18.5\n18.4\n\n16.4\n13.5\n16.2\n16.1\n16.1\n16.0\n\n16.4\n13.5\n16.3\n16.1\n16.2\n16.2\n\nL\n\n12.4\n10.3\n12.2\n11.5\n11.6\n11.8\n\nHS\n\n12.9\n\n12.4\n10.3\n12.2\n12.0\n12.0\n12.0\n\nAcknowledgements. The work was partially supported by NSF Grants IIS 0643916, IIS 1117707.\n\n8\n\n\fReferences\n[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n\nlearning from labeled and unlabeled examples. JMLR, 7:2399\u20132434, 2006.\n\n[2] S. Bickel, M. Br\u00a8uckner, and T. Scheffer. Discriminative learning for differing training and test\n\ndistributions. In ICML, 2007.\n\n[3] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from\n\nexamples as an inverse problem. JMLR, 6:883, 2006.\n\n[4] E. De Vito, L. Rosasco, and A. Toigo. Spectral regularization for support estimation. In NIPS,\n\npages 487\u2013495, 2010.\n\n[5] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems. Springer, 1996.\n[6] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Sch\u00a8olkopf. Covariate\n\nshift by kernel mean matching. Dataset shift in machine learning, pages 131\u2013160, 2009.\n\n[7] S. Gr\u00a8unew\u00a8alder, A. Gretton, and J. Shawe-Taylor. Smooth operators. In ICML, 2013.\n[8] S. Gr\u00a8unew\u00a8alder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil. Conditional\n\nmean embeddings as regressors. In ICML, 2012.\n\n[9] J. Huang, A. Gretton, K. M. Borgwardt, B. Sch\u00a8olkopf, and A. Smola. Correcting sample\n\nselection bias by unlabeled data. In NIPS, pages 601\u2013608, 2006.\n\n[10] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance esti-\n\nmation. JMLR, 10:1391\u20131445, 2009.\n\n[11] J. S. Kim and C. Scott. Robust kernel density estimation. In ICASSP, pages 3381\u20133384, 2008.\n[12] S. Mukherjee and V. Vapnik. Support vector method for multivariate density estimation. In\nCenter for Biological and Computational Learning. Department of Brain and Cognitive Sci-\nences, MIT. CBCL, volume 170, 1999.\n\n[13] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[14] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the\n\nlikelihood ratio by penalized convex risk minimization. NIPS, 20:1089\u20131096, 2008.\n\n[15] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE\n\nTransactions on, 22(10):1345\u20131359, 2010.\n\n[16] B. Sch\u00a8olkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2001.\n\n[17] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university\n\npress, 2004.\n\n[18] T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspaces of convolution operators and\n\nclustering. The Annals of Statistics, 37(6B):3960\u20133984, 2009.\n\n[19] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[20] A. J. Smola and B. Sch\u00a8olkopf. On a kernel-based method for pattern recognition, regression,\n\napproximation, and operator inversion. Algorithmica, 22(1):211\u2013231, 1998.\n[21] I. Steinwart and A. Christmann. Support vector machines. Springer, 2008.\n[22] M. Sugiyama, M. Krauledat, and K. M\u00a8uller. Covariate shift adaptation by importance weighted\n\ncross validation. JMLR, 8:985\u20131005, 2007.\n\n[23] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki\nKawanabe. Direct importance estimation with model selection and its application to covariate\nshift adaptation. NIPS, 20:1433\u20131440, 2008.\n\n[24] A. Tsybakov. Introduction to nonparametric estimation. Springer, 2009.\n[25] C. Williams and M. Seeger. The effect of the input density distribution on kernel-based classi-\n\n\ufb01ers. In ICML, 2000.\n\n[26] Y. Yu and C. Szepesv\u00b4ari. Analysis of kernel mean matching under covariate shift. In ICML,\n\n2012.\n\n[27] B. Zadrozny. Learning and evaluating classi\ufb01ers under sample selection bias. In ICML, 2004.\n\n9\n\n\f", "award": [], "sourceid": 743, "authors": [{"given_name": "Qichao", "family_name": "Que", "institution": "Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}]}