{"title": "Learning with Consistency between Inductive Functions and Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1849, "page_last": 1856, "abstract": "Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on a constant function. On the other hand, while kernel-based algorithms have been developed in such a tendency that almost all learning algorithms are kernelized or being kernelized, a basic fact is often ignored: The learned function from the data and the kernel fits the data well, but may not be consistent with the kernel. Based on these considerations and on the intuition that a good kernel-based inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions.", "full_text": "Learning with Consistency between Inductive\n\nFunctions and Kernels\n\nHaixuan Yang1,2\n\nIrwin King1\n\nMichael R. Lyu1\n\n1Department of Computer Science & Engineering\n\nThe Chinese University of Hong Kong\n\n{hxyang,king,lyu}@cse.cuhk.edu.hk\n\n2Department of Computer Science\n\nRoyal Holloway University of London\n\nhaixuan@cs.rhul.ac.hk\n\nAbstract\n\nRegularized Least Squares (RLS) algorithms have the ability to avoid over-\ufb01tting\nproblems and to express solutions as kernel expansions. However, we observe\nthat the current RLS algorithms cannot provide a satisfactory interpretation even\non the penalty of a constant function. Based on the intuition that a good kernel-\nbased inductive function should be consistent with both the data and the kernel, a\nnovel learning scheme is proposed. The advantages of this scheme lie in its cor-\nresponding Representer Theorem, its strong interpretation ability about what kind\nof functions should not be penalized, and its promising accuracy improvements\nshown in a number of experiments. Furthermore, we provide a detailed techni-\ncal description about heat kernels, which serves as an example for the readers to\napply similar techniques for other kernels. Our work provides a preliminary step\nin a new direction to explore the varying consistency between inductive functions\nand kernels under various distributions.\n\n1 Introduction\n\nRegularized Least Squares (RLS) algorithms have been drawing people\u2019s attention since they were\nproposed due to their ability to avoid over-\ufb01tting problems and to express solutions as kernel ex-\npansions in terms of the training data [4, 9, 12, 13]. Various modi\ufb01cations of RLS are made to\nimprove its performance either from the viewpoint of manifold [1] or in a more generalized form\n[7, 11]. However, despite these modi\ufb01cations, problems still remain. We observe that the previous\nRLS-related work has the following problem:\nOver Penalization. For a constant function f = c, a nonzero term ||f ||K is penalized in both\nRLS and LapRLS [1]. As a result, for a distribution generalized by a nonzero constant function,\nthe resulting regression function by both RLS and LapRLS is not a constant as illustrated in the left\ndiagram in Fig. 1. For such situations, there is an over-penalization.\n\nIn this work, we aim to provide a new viewpoint for supervised or semi-supervised learning prob-\nlems. By such a viewpoint we can provide a general condition under which constant functions\nshould not be penalized. The basic idea is that, if a learning algorithm can learn an inductive func-\ntion f (x) from examples generated by a joint probability distribution P on X \u00d7 R, then the learned\nfunction f (x) and the marginal PX represents a new distribution on X \u00d7 R, from which there is a\nre-learned function r(x). The re-learned function should be consistent with the learned function in\nthe sense that the expected difference on distribution PX is small. Because the re-learned function\ndepends on the underlying kernel, the difference f (x) \u2212 r(x) depends on f (x) and the kernel, and\nfrom this point of view, we name this work.\n\n\fRLS\n\nThe Re\u2212learned function and the Residual\n\n1.02\n\n1.01\n\n1\n\ny\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0\n\n1\n\n0.5\n\ny\n\n0\n\n\u22120.5\n\n1\n\n\u22121\n\n\u22122\n\n0\nx\n\n2\n\nThe Ideal Function\nLabeled Data\nRLS\u2212\u03b3=0.1\nRLS\u2212\u03b3=0.01\nRLS\u2212\u03b3\nRLS\u2212\u03b3=0.005\n\nA=0\n\n0.5\nx\n\n1.003\n\n1.002\n\nf(x)\nr(x)\nf(x)\u2212r(x)\n\n1.001\n\ny\n\n1\n\n0.999\n\n0.998\n\n0\n\nRLS vs PRLS\n\nRLS\u2212\u03b3=0.005\nPRLS\u2212\u03b3=1000\nPRLS\u2212\u03b3=1\nPRLS\u2212\u03b3=0.001\nPRLS\u2212\u03b3=0\n\n0.5\nx\n\n1\n\nFigure 1: Illustration for over penalization. Left diagram: The training set contains 20 points, whose\nx is randomly drawn from the interval [0 1], whereas the test set contains another 20 points, and\ny is generated by 1 + 0.005\u03b5, \u03b5 \u223c N (0, 1). The over penalized constant functions in the term\n||f ||K cause the phenomena that smaller \u03b3 can achieve better results. On the other hand, the over-\n\ufb01tting phenomenon when \u03b3 = 0 suggests the necessity of the regularization term. Based on these\nobservations, an appropriate penalization on a function is expected. Middle diagram: r(x) is very\nsmooth, and f (x)\u2212r(x) remains the uneven part of f (x); therefore f (x)\u2212r(x) should be penalized\nwhile f is over penalized in ||f ||K. Right diagram: the proposed model has a stable property so that\na large variant of \u03b3 results in small changes of the curves, suggesting a right way of penalizing\nfunctions.\n\n2 Background\n\nThe RKHS Theory enables us to express solutions of RLS as kernel expansions in terms of the\ntraining data. Here we give a brief description of the concepts. For a complete discussion, see [2].\nLet X be a compact domain or manifold, \u03bd be a Borel measure on X, and K : X \u00d7 X \u2192 R be\na Mercer kernel, then there is an associated Hilbert space RKHS HK of functions X \u2192 R with\nthe corresponding norm || \u00b7 ||K. HK satis\ufb01es the reproducing property, i.e., for all f \u2208 HK,\nf (x) = hKx, f i, where Kx is the function K(x, \u00b7). Moreover, an operator LK can be de\ufb01ned on\n\u03bd(X) is the Hilbert space of square integrable\n\nHK as: (LKf )(x) = RX f (y)K(x, y)d\u03bd(y), where L2\nfunctions on X with the scalar product hf, gi\u03bd = RX f (x)g(x)d\u03bd(x).\n\nGiven a Mercer kernel and a set of labeled examples (xi, yi) (i = 1, ..., l), there are two popular\ninductive learning algorithms: RLS [12, 13] and the Nadaraya-Watson Formula [5, 8, 14]. By the\nstandard Tikhonov regularization, RLS is a special case of the following functional extreme problem:\n\nf \u2217 = arg min\nf \u2208HK\n\n1\nl\n\nl\n\nXi=1\n\nV (xi, yi, f ) + \u03b3||f ||2\nK\n\n(1)\n\nwhere V is some loss function.\nThe Classical Representer Theorem states that the solution to this minimization problem exists in\nHK and can be written as\n\nf \u2217(x) =\n\nl\n\nXi=1\n\n\u03b1iK(xi, x).\n\n(2)\n\nSuch a Representer Theorem is general because it plays an important role in both RLS in the case\nwhen V (x, y, f ) = (y \u2212 f (x))2, and SVM in the case when V (x, y, f ) = max(0, 1 \u2212 yf (x)).\n\n\fThe Nadaraya-Watson Formula is based on local weighted averaging, and it comes with a closed\nform:\n\nl\n\nl\n\nr(x) =\n\nXi=1\n\nyiK(x, xi)/\n\nXi=1\n\nK(x, xi).\n\n(3)\n\nThe formula has a similar appearance as Eq. (2), but it plays an important role in this paper because\nwe can write it in an integral form which makes our idea technically feasible as follows. Let p(x) be\na probability density function over X, P (x) be the corresponding cumulative distribution function,\nand f (x) be an inductive function. We observe that, if (xi, f (xi))(i = 1, 2, . . . , l) are sampled from\nthe function y = f (x), then\nA Re-learned Function can be expressed as\n\nr(x) = lim\n\nl\u2192\u221e Pl\nPl\n\ni=1 f (xi)K(x, xi)\n\ni=1 K(x, xi)\n\n= RX f (\u03b1)K(x, \u03b1)dP (\u03b1)\nRX K(x, \u03b1)dP (\u03b1)\n\n=\n\nLK(f )\n\nRX K(x, \u03b1)dP (\u03b1)\n\n,\n\n(4)\n\nbased on f (x) and P (x). From this form, we show two points: (1) If r(x) = f (x), then f (x) is\ncompletely predicted by itself through the Nadaraya-Watson Formula, and so f (x) is considered\nto be completely consistent with the kernel K(x, y); if r(x) 6= f (x), then the difference ||f (x) \u2212\nr(x)||K can measure how badly f (x) is consistent with the kernel K(x, y) and (2) Intuitively r(x)\ncan also be understood as the smoothed function of f (x) through a kernel K. Consequently, f (x) \u2212\nr(x) represents the intrinsically uneven part of f (x), which we will penalize. This intuition is\nillustrated in the middle diagram in Fig. 1.\n\nThroughout this paper, we assume thatRX K(x, \u03b1)dP (\u03b1) is a constant, and for simplicity all kernels\nare normalized by K/RX K(x, \u03b1)dP (\u03b1) so that r(x) = LK(f ). Moreover, we assume that X is\n\ncompact, and the measure \u03bd is speci\ufb01ed as P (x).\n\n3 Partially-penalized Regularization\n\nFor a given kernel K and an inductive function f, LK(f ) is the prediction function produced by K\nthrough the Nadaraya-Watson Formula. Based on Eq. (1), penalizing the inconsistent part f (x) \u2212\nLK(f ) leads to the following Partially-penalized Regularization problem:\n\nf \u2217 = arg min\nf \u2208HK\n\n1\nl\n\nl\n\nXi=1\n\nV (xi, yi, f ) + \u03b3||f \u2212 LK(f )||2\n\nK .\n\n(5)\n\nTo obtain a Representer Theorem, we need one assumption.\n\nAssumption 1 Let f1, f2 \u2208 HK. If hf1, f2iK = 0, then ||f1 \u2212 LK(f1) + f2 \u2212 LK(f2)||2\n||f1 \u2212 LK(f1)||2\n\nK + ||f2 \u2212 LK(f2)||2\n\nK .\n\nK =\n\nIt is well-known that the operator LK is compact, self-adjoint, and positive with respect to L2\n\u03bd(X),\nand by the Spectral Theorem [2, 3], its eigenfunctions e1(x), e2(x), . . . form an orthogonal basis of\n\u03bd(X) and the corresponding eigenvalues \u03bb1 \u2265 \u03bb2, . . . are either \ufb01nitely many that are nonzero,\nL2\nor there are in\ufb01nitely many, in which case \u03bbk \u2192 0. Let f1 = Pi aiei(x), f2 = Pi biei(x), then\nf1\u2212LK(f1) = Pi aiei(x)\u2212LK(Pi aiei(x)) = Pi aiei(x)\u2212Pi \u03bbiaiei(x) = Pi(1\u2212\u03bbi)aiei(x),\nand similarly, f2 \u2212 LK(f2) = Pi(1 \u2212 \u03bbi)biei(x). By the discussions in [1], we have hei, eji\u03bd = 0\nhf1 \u2212 LK(f1), f2 \u2212 LK(f2)iK = Pi(1 \u2212 \u03bbi)2aibihei(x), ei(x)iK = 0. Therefore, under some\n\nif i 6= j, and hei, eii\u03bd = 1; hei, ejiK = 0 if i 6= j, and hei, eiiK = 1\n. If we consider the situation\n\u03bbi\nthat ai, bi \u2265 0 for all i \u2265 1, then hf1, f2iK = 0 implies that aibi = 0 for all i \u2265 1, and consequently\n\nconstrains, this assumption is a fact. Under this assumption, we have a Representer Theorem.\n\nTheorem 2 Let \u00b5j(x) be a basis in H0 of the operator I \u2212 LK, i.e., H0 = {f \u2208 HK|f \u2212 LK(f ) =\n0}. Under Assumption 1, the minimizer of the optimization problem in Eq. (5) is\n\nf \u2217(x) =\n\no\n\nXj=1\n\n\u03b2j\u00b5j(x) +\n\nl\n\nXi=1\n\n\u03b1iK(xi, x)\n\n(6)\n\n\fProof of the Representer Theorem. Any function f \u2208 HK can be uniquely decomposed into a\ncomponent f|| in the linear subspace spanned by the kernel functions {K(xi, \u00b7)}l\ni=1, and a compo-\n\nnent f\u22a5 orthogonal to it. Thus, f = f|| + f\u22a5 =\nand the fact that hf\u22a5, K(xi, \u00b7)i = 0 for 1 \u2264 i \u2264 l, we have\n\nPi=1\n\nl\n\n\u03b1iK(xi, \u00b7) + f\u22a5. By the reproducing property\n\nf (xj) = hf, K(xj, \u00b7)i = h\n\nXi=1\n\n\u03b1iK(xi, \u00b7), K(xj, \u00b7)i + hf\u22a5, K(xj, \u00b7)i = h\n\nXi=1\n\nl\n\nl\n\n\u03b1iK(xi, \u00b7), K(xj, \u00b7)i.\n\nThus the empirical terms involving the loss function in Eq. (5) depend only on the value of the\ncoef\ufb01cients {\u03b1i}l\n\ni=1 and the gram matrix of the kernel function. By Assumption 1, we have\n\n||f \u2212 LK(f )||2\n\nK = ||\n\n\u2265 ||\n\nl\n\nl\n\nPi=1\nPi=1\n\n\u03b1iK(xi, \u00b7) \u2212 LK(\n\n\u03b1iK(xi, \u00b7) \u2212 LK(\n\nl\n\nl\n\nPi=1\nPi=1\n\n\u03b1iK(xi, \u00b7))||2\n\nK + ||f\u22a5 \u2212 LK(f\u22a5)||2\nK\n\n\u03b1iK(xi, \u00b7))||2\n\nK .\n\nIt follows that the minimizer of Eq. (5) must have ||f\u22a5 \u2212 LK(f\u22a5)||2\nPi=1\n\nrepresentation f \u2217(x) = f\u22a5 +\n\n\u03b1iK(xi, x) =\n\n\u03b2j\u00b5j(x) +\n\nPj=1\n\nPi=1\n\no\n\nl\n\nl\n\nK = 0, and therefore admits a\n\u03b1iK(xi, x).\n\n3.1 Partially-penalized Regularized Least Squares (PRLS) Algorithm\n\nIn this section, we focus our attention in the case that V (xi, yi, f ) = (yi \u2212 f (xi))2, i.e, the Regu-\nlarized Least Squares algorithm. In our setting, we aim to solve:\n\nmin\nf \u2208HK\n\n1\n\nl X(yi \u2212 f (xi))2 + \u03b3||f \u2212 LK(f )||2\n\nK .\n\nBy the Representer Theorem, the solution to Eq. (7) is of the following form:\n\nf \u2217(x) =\n\no\n\nXj=1\n\n\u03b2j\u00b5j(x) +\n\nl\n\nXi=1\n\n\u03b1iK(xi, x).\n\n(7)\n\n(8)\n\nBy the proof of Theorem 2, we have f\u22a5 =\n\n\u03b2j\u00b5j(x) and hf\u22a5,\n\n\u03b1iK(xi, x)iK = 0. By\n\nAssumption 1 and the fact that f\u22a5 belongs to the null space H0 of the operator I \u2212 LK, we have\n\no\n\nPj=1\n\nl\n\nPi=1\n\n= ||Pl\n\ni=1 \u03b1iK(xi, x) \u2212Pl\n\n||f \u2217 \u2212 LK(f \u2217)||2\n\nK = ||f\u22a5 \u2212 LK(f\u22a5)||2\n\nK + ||Pl\n\ni=1 \u03b1iK(xi, x) \u2212 LK(Pl\nK = \u03b1T (K \u2212 2K 0 + K 00)\u03b1,\n\ni=1 \u03b1iLK(K(xi, x))||2\n\ni=1 \u03b1iK(xi, x))||2\nK\n\n(9)\nwhere \u03b1 = [\u03b11, \u03b12, . . . , \u03b1l]T , K is the l \u00d7 l gram matrix Kij = K(xi, xj), K 0 and\nK 00 are reconstructed l \u00d7 l matrices K 0\nij =\nhLK(K(xi, x)), LK (K(xj, x))iK. Substituting Eq. (8) and Eq. (9) to the problem in Eq. (7), we ar-\nrive at the following quadratic objective function of the l-dimensional variable \u03b1 and o-dimensional\nvariable \u03b2 = [\u03b21, \u03b22, . . . , \u03b2o]T :\n\nij = hK(xi, x), LK (K(xj, x))iK, and K 00\n\n[\u03b1\u2217, \u03b2\u2217] = arg min\n\n1\nl\n\n(Y \u2212 K\u03b1 \u2212 \u03a8\u03b2)T (Y \u2212 K\u03b1 \u2212 \u03a8\u03b2) + \u03b3\u03b1T (K \u2212 2K 0 + K 00)\u03b1,\n\n(10)\n\nwhere \u03a8 is an l \u00d7 o matrix \u03a8ij = \u00b5j(xi), and Y = [y1, y2, . . . , yl]T . Taking derivatives with respect\nto \u03b1 and \u03b2, since the derivative of the objective function vanishes at the minimizer, we obtain\n\n(\u03b3l(K \u2212 2K 0 + K 00) + K 2)\u03b1 + K\u03a8\u03b2 = KY, \u03a8T (Y \u2212 K\u03b1 \u2212 \u03a8\u03b2) = 0.\n\n(11)\n\nIn the term ||f \u2212LK (f )||, f is subtracted by LK(f ), and so it partially penalized. For this reason, the\nresulting algorithm is referred as Partially-penalized Regularized Least Squares algorithm (PRLS).\n\n\f3.2 The PLapRLS Algorithm\n\nThe idea in the previous section can also be extended to LapRLS in the manifold regularization\nframework [1].\nIn the manifold setting, the smoothness on the data adjacency graph should be\nconsidered, and Eq. (5) is modi\ufb01ed as\n\nf \u2217 = arg min\nf \u2208HK\n\n1\nl\n\nl\n\nXi=1\n\nV (xi, yi, f )+\u03b3A||f \u2212LK (f )||2\n\nK +\n\n\u03b3I\n\n(u + l)2\n\nl+u\n\n(f (xi)\u2212f (xj))2Wij, (12)\n\nXi,j=1\n\nwhere Wij are edge weights in the data adjacency. From W , the graph Laplacian L is given by\nj=1 Wij. For this optimization problem,\n\nL = D \u2212 W , where D is the diagonal matrix with Dii = Pl+u\n\nthe result in Theorem 2 can be modi\ufb01ed slightly as:\n\nTheorem 3 Under Assumption 1, the minimizer of the optimization problem in Eq. (12) admits an\nexpansion\n\nf \u2217(x) =\n\no\n\nXj=1\n\n\u03b2j\u00b5j(x) +\n\nl+u\n\nXi=1\n\n\u03b1iK(xi, x).\n\n(13)\n\nFollowing Eq.\n[\u03b11, \u03b12, . . . , \u03b1l+u]\u03b1 and the o-dimensional variable \u03b2 = [\u03b21, \u03b22, . . . , \u03b2o]T .\nthe previous section and LapRLS in [1], \u03b1 and \u03b2 are determined by the following linear systems:\n\n(13), we continue to optimize the (l + u)-dimensional variable \u03b1 =\nIn a similar way as\n\n(cid:26) (KJK + \u03bb1(K \u2212 2K 0 + K 00) + \u03bb2KLK)\u03b1 + (KJ\u03a8 + \u03bb2KL\u03a8)\u03b2 = KJY,\n\n(\u03a80JK \u2212 \u03bb2\u03a80LK)\u03b1 + (\u03a80\u03a8 \u2212 \u03bb2\u03a80L\u03a8)\u03b2 = \u03a80 \u2217 Y,\n\n(14)\n\nwhere K, K 0, K 00 are the (l + u) \u00d7 (l + u) Gram matrices over labeled and unlabeled points; Y is an\n(l + u) dimensional label vector given by: Y = [y1, y2, . . . , yl, 0, . . . , 0], J is an (l + u) \u00d7 (l + u)\ndiagonal matrix given by J = diag(1, 1, . . . , 1, 0, . . . , 0) with the \ufb01rst l diagonal entries as 1 and the\nrest 0, and \u03a8 is an (l + u) \u00d7 o matrix \u03a8ij = \u00b5j(xi).\n\n4 Discussions\n\n4.1 Heat Kernels and the Computation of K 0 and K 00\n\nIn this section we will illustrate the computation of K 0 and K 00 in the case of heat kernels. The basic\nfacts about heat kernels are excerpted from [6], and for more materials, see [10].\nGiven a manifold M and points x and y, the heat kernel Kt(x, y) is a special solution to the heat\nequation with a special initial condition called the delta function \u03b4(x\u2212y). More speci\ufb01cally, \u03b4(x\u2212y)\ndescribes a unit heat source at position y with no heat in other positions. Namely, \u03b4(x \u2212 y) = 0 for\n\u2212\u221e \u03b4(x \u2212 y)dx = 1. If we let f0(x, 0) = \u03b4(x \u2212 y), then Kt(x, y) is a solution to the\n\nx 6= y and R +\u221e\n\nfollowing differential equation on a manifold M:\n\n\u2202f\n\u2202t\n\n\u2212 Lf = 0, f (x, 0) = f0(x),\n\n(15)\n\nwhere f (x, t) is the temperature at location x at time t, beginning with an initial distribution f0(x) at\ntime zero, and L is the Laplace-Beltrami operator. Equation (15) describes the heat \ufb02ow throughout\na geometric manifold with initial conditions.\n\nTheorem 4 Let M be a complete Riemannian manifold. Then there exists a function K \u2208\nC\u221e(R+ \u00d7 M \u00d7 M), called the heat kernel, which satis\ufb01es the following properties for\nall x, y \u2208 M, with Kt(x, y) = K(t, x, y):\n(2)\n\n(1) Kt(x, y) de\ufb01nes a Mercer kernel.\n\nKt(x, y) = RM Kt\u2212s(x, z)Ks(z, y)dz for any s > 0. (3) The solution to Eq. (15) is f (x, t) =\n(4) 1 = RM Kt(x, y)1dy and (5) When M = Rm, Lf is simpli\ufb01ed as\nRM Kt(x, y)f0(y)dy.\nPi\n\n, and the heat kernel takes the Gaussian RBF form Kt(x, y) = (4\u03c0t)\u2212 m\n\n2 e\u2212 ||x\u2212y||2\n\n\u2202 2f\n\u2202x2\ni\n\n4t\n\n.\n\n\fK 0 and K 00 can be computed as follows:\n\nK 0\n\nij = hKt(xi, x), LK (Kt(xj, x))iK (by de\ufb01nition)\n\n= LK(Kt(xj, x))|x=xi (by the reproducing property of a Mercer kernel)\n= RX Kt(xj, y)Kt(xi, y)d\u03bd(y) (by the de\ufb01nition of LK)\n\n= K2t(xi, xj) (by Property 2 in Theorem 4)\n\n(16)\n\nBased on the fact that LK is self-adjoint, we can similarly derive K 00\nkernels, K 0 and K 00 can also be computed.\n\nij = K3t(xi, xj). For other\n\n4.2 What should not be penalized?\n\nFrom Theorem 2, we know that the functions in the null space H0 = {f \u2208 HK|f \u2212 LK(f ) =\n0} should not be penalized. Although there may be looser assumptions that can guarantee the\nvalidity of the result in Theorem 2, there are two assumptions in this work: X is compact and\n\nRX K(x, \u03b1)dP (\u03b1) in Eq. (4) is a constant. Next we discuss the constant functions and the linear\n\nfunctions.\nShould constant functions be penalized? Under the two assumptions, a constant function c should\n\nnot be penalized, because c = RX cK(x, \u03b1)p(\u03b1)d\u03b1/RX K(x, \u03b1)p(\u03b1)d\u03b1, i.e., c \u2208 H0. For heat\nkernels, if P (x) is uniformly distributed on M, then by Property 4 in Theorem 4,RX K(x, \u03b1)dP (\u03b1)\n\nis a constant, and so c should not be penalized.\nFor polynomial kernels, the theory cannot guarantee that constant functions should not be penalized\neven with a uniform distribution P (x). For example, considering the polynomial kernel xy +1 in the\n0 (xy +1)dy = x/2+1\nis not a constant. As a counter example, we will show in Section 5.3 that not penalizing constant\nfunctions in polynomial kernels will result in much worse accuracy. The reason for this phenomenon\nis that constant functions may not be smooth in the feature space produced by the polynomial kernel\n0 (xy + 1)dP (y)\n\ninterval X = [0 1] and the uniform distribution on X,RX (xy +1)dP (y) = R 1\n\nunder some distributions. The readers can deduce an example for p(x) such that R 1\n\nhappens to be a constant.\nShould linear function aT x be penalized? In the case when X is a closed ball Br with radius\nr when P (x) is uniformly distributed over Br and when K is the Gaussian RBF kernel, then aT x\n\u00b7dx\nKt(x, y)aT ydy \u2248 LK(aT x).\nConsequently ||aT x \u2212 LK(aT x)||K will be small enough, and so the linear function aT x needs not\nbe penalized. For other kernels, other spaces, or other PX, the conclusion may not be true.\n\nshould not be penalized when r is big enough. 1 Since r is big enough, we have RRn \u00b7dx \u2248 RBr\nand RBr\n\nKt(x, y)dy \u2248 1, and so aT x = RRn Kt(x, y)aT ydy \u2248 RBr\n\n5 Experiments\n\nIn this section, we evaluate the proposed algorithms PRLS and PLapRLS on a toy dataset (size: 40),\na medium-sized dataset (size: 3,119), and a large-sized dataset (size: 20,000), and provide a counter\nexample for constant functions on another dataset (size: 9,298). We use the Gaussian RBF kernels in\nthe \ufb01rst three datasets, and use polynomial kernels to provide a counter example on the last dataset.\nWithout any prior knowledge about the data distribution, we assume that the examples are uniformly\ndistributed, and so constant functions are considered to be in H0 for the Gaussian RBF kernel, but\nlinear functions are not considered to be in H0 since it is rare for data to be distributed uniformly on\na large ball. The data and results for the toy dataset are illustrated in the left diagram and the right\ndiagram in Fig. 1.\n\n5.1 UCI Dataset Isolet about Spoken Letter Recognition\n\nWe follow the same semi-supervised settings as that in [1] to compare RLS with PRLS, and compare\nLapRLS with PLapRLS on the Isolet database. The dataset contains utterances of 150 subjects who\n\n1Note that a subset of Rn is compact if and only if it is closed and bounded. Since Rn is not bounded, it\nis not compact, and so the Representer Theorem cannot be established. This is the reason why we cannot talk\nabout Rn directly.\n\n\fRLS vs PRLS\n\nLapRLS vs PLapRLS\n\nLapRLS\nPLapRLS\n\n25\n\n20\n\n15\n\nl\n\n)\nt\ne\ns\n \nd\ne\ne\nb\na\nn\nu\n(\n \ns\ne\nt\na\nR\n\nl\n\n \nr\no\nr\nr\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nLabeled Speaker #\n\nLapRLS vs PLapRLS\n\nLapRLS\nPLapRLS\n\nl\n\n)\nt\ne\ns\n \nd\ne\ne\nb\na\nn\nu\n(\n \ne\nt\na\nR\n\nl\n\n \nr\no\nr\nr\n\nE\n\n)\nt\ne\ns\n \nt\ns\ne\nt\n(\n \ns\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n\nRLS\nPRLS\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nLabeled Speaker #\n\nRLS vs PPLS\n\nRLS\nPRLS\n\n28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\n14\n\n12\n\n10\n\n35\n\n30\n\n25\n\n20\n\n15\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nLabeled Speaker #\n\nE\n\n10\n\n0\n\n)\nt\ne\ns\n \nt\ns\ne\nt\n(\n \ns\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n\n32\n\n30\n\n28\n\n26\n\n24\n\n22\n\n20\n\n18\n\n16\n\n14\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nLabeled Speaker #\n\nFigure 2: Isolet Experiment\n\npronounced the name of each letter of the English alphabet twice. The speakers were grouped into\n5 sets of 30 speakers each. The data of the \ufb01rst 30 speakers forms a training set of 1,560 examples,\nand that of the last 29 speakers forms the test set. The task is to distinguish the \ufb01rst 13 letters from\nthe last 13. To simulate a real-world situation, 30 binary classi\ufb01cation problems corresponding to 30\nsplits of the training data where all 52 utterances of one speaker were labeled and all the rest were\nleft unlabeled. All the algorithms use Gaussian RBF kernels. For RLS and LapRLS, the results were\nobtained with width \u03c3 = 10, \u03b3l = 0.05, \u03b3Al = \u03b3I l/(u + l)2 = 0.005. For PRLS and PLapRLS,\nthe results were obtained with width \u03c3 = 4, \u03b3l = 0.01, and \u03b3Al = \u03b3I l/(u + l)2 = 0.01. In Fig. 2,\nwe can see that both PRLS and PLapRLS make signi\ufb01cant performance improvements over their\ncorresponding counterparts on both unlabeled data and test set.\n\n5.2 UCI Dataset Letter about Printed Letter Recognition\n\nIn Dataset Letter, there are 16 features for each example, and there are 26 classes representing the\nupper case printed letters. The \ufb01rst 400 examples were taken to form the training set. The remaining\n19,600 examples form the test set. The parameters are set as follows: \u03c3 = 1, \u03b3l = \u03b3A(l+u) = 0.25,\nand \u03b3I l/(u + l)2 = 0.05. For each of the four algorithms RLS, PRLS, LapRLS, and PLapRLS, for\neach of the 26 one-versus-all binary classi\ufb01cation tasks, and for each of 10 runs, two examples for\neach class were randomly labeled. For each algorithm, the averages over all the 260 one-versus-all\nbinary classi\ufb01cation error rates for unlabeled 398 examples and test set are listed respectively as\nfollows: (5.79%, 5.23%) for RLS, (5.12%, 4.77%) for PRLS, (0%, 2.96%) for LapRLS, and (0%,\n3.15%) for PLapRLS respectively. From the results, we can see that RLS is improved on both\nunlabeled examples and test set. The fact that there is no error in the total 260 tasks for LapRLS\nand PLapRLS on unlabeled examples suggests that the data is distributed in a curved manifold. On\na curved manifold, the heat kernels do not take the Gaussian RBF form, and so PLapRLS using the\nGaussian RBF form cannot achieve its best. This is the reason why we can observe that PLapRLS\nis slightly worse than LapRLS on the test set. This suggests the need for a vast of investigations on\nheat kernels on a manifold.\n\n5.3 A Counter Example in Handwritten Digit Recognition\n\nNote that, polynomial kernels with degree 3 were used on USPS dataset in [1], and 2 images for each\nclass were randomly labeled. We follow the same experimental setting as that in [1]. For RLS, if we\n\n\fuse Eq. (2), then the averages of 45 pairwise binary classi\ufb01cation error rates are 8.83% and 8.41%\nfor unlabeled 398 images and 8,898 images in the test set respectively. If constant functions are not\ni=1 \u03b1iK(xi, x) + a, and the corresponding error rates are\n9.75% and 9.09% respectively. By this example, we show that leaving constant functions outside\nthe regularization term is dangerous; however, it is fortunate that we have a theory to guide this in\n\npenalized, then we should use f \u2217(x) = Pl\nSection 4: if X is compact and RX K(x, \u03b1)dP (\u03b1) in Eq. (4) is a constant, then constant functions\n\nshould not be penalized.\n\n6 Conclusion\n\nA novel learning scheme is proposed based on a new viewpoint of penalizing the inconsistent part\nbetween inductive functions and kernels. In theoretical aspects, we have three important claims: (1)\nOn a compact domain or manifold, if the denominator in Eq. (4) is a constant, then there is a new\nRepresenter Theorem; (2) The same conditions become a suf\ufb01cient condition under which constant\nfunctions should not be penalized; and (3) under the same conditions, a function belongs to the\nnull space if and only if the function should not be penalized. Empirically, we claim that the novel\nlearning scheme can achieve accuracy improvement in practical applications.\n\nAcknowledgments\n\nThe work described in this paper was supported by two grants from the Research Grants Council of\nthe Hong Kong Special Administrative Region, China (Project No. CUHK4150/07E) and Project\nNo. CUHK4235/04E). The \ufb01rst author would like to thank Hao Ma for his helpful suggestions,\nthank Kun Zhang and Wenye Li for useful discussions, and thank Alberto Paccanaro for his support.\n\nReferences\n\n[1] GMikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of Machine Learning\nResearch, 7:2399\u20132434, 2006.\n\n[2] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin (New Series)\n\nof the American Mathematical Society, 39(1):1\u201349, 2002.\n\n[3] Lokenath Debnath and Piotr Mikusinski.\n\nIntroduction to Hilbert Spaces with Applications.\n\nAcademic Press, San Diego, second edition, 1999.\n\n[4] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines.\n\nAdvances in Computational Mathematics, 13:1\u201350, 2000.\n\n[5] T. Hastie and C. Loader. Local regression: Automatic kernel carpentry. Statistical Science,\n\n8(1):120\u2013129, 1993.\n\n[6] John Lafferty and Guy Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine\n\nLearning Research, 6:129\u2013163, 2005.\n\n[7] Wenye Li, Kin-Hong Lee, and Kwong-Sak Leung. Generalized regularized least-squares learn-\n\ning with prede\ufb01ned features in a Hilbert space. In NIPS, 2006.\n\n[8] E. A. Nadaraya. On estimating regression. Theory of Probability and Its Applications,\n\n9(1):141\u2013142, 1964.\n\n[9] R.M. Rifkin and R.A. Lippert. Notes on regularized least-squares. Technical Report 2007-019,\n\nMassachusetts Institute of Technology, 2007.\n\n[10] S. Rosenberg. The Laplacian on a Riemmannian Manifold. Cambridge University Press, 1997.\n[11] Bernhard Sch\u00a8olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In\n\nCOLT, 2001.\n\n[12] I. Sch\u00a8onberg. Spline functions and the problem of graduation. Proc. Nat. Acad. Sci. USA,\n\n52:947\u2013950, 1964.\n\n[13] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston, 1977.\n[14] G. S. Watson. Smooth regression analysis. Sankhy\u00b4a, Series A, 26:359\u2013372, 1964.\n\n\f", "award": [], "sourceid": 644, "authors": [{"given_name": "Haixuan", "family_name": "Yang", "institution": null}, {"given_name": "Irwin", "family_name": "King", "institution": null}, {"given_name": "Michael", "family_name": "Lyu", "institution": null}]}