{"title": "Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 888, "abstract": null, "full_text": "Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space\nWenye Li, Kin-Hong Lee, Kwong-Sak Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, Hong Kong, China {wyli, khlee, ksleung}@cse.cuhk.edu.hk\n\nAbstract\nKernel-based regularized learning seeks a model in a hypothesis space by minimizing the empirical error and the model's complexity. Based on the representer theorem, the solution consists of a linear combination of translates of a kernel. This paper investigates a generalized form of representer theorem for kernel-based learning. After mapping predefined features and translates of a kernel simultaneously onto a hypothesis space by a specific way of constructing kernels, we proposed a new algorithm by utilizing a generalized regularizer which leaves part of the space unregularized. Using a squared-loss function in calculating the empirical error, a simple convex solution is obtained which combines predefined features with translates of the kernel. Empirical evaluations have confirmed the effectiveness of the algorithm for supervised learning tasks.\n\n1 Introduction\nSupervised learning, or learning from examples, refers to the task of training a system by a set of examples which are specified by input-output pairs. The system is used to predict the output value for any valid input object after training. Examples of such tasks include regression which produces continuous values, and classification which predicts a class label for an input object. Vapnik's seminal work[1] shows that the key to effectively solving this problem is by controlling the solution's complexity, which leads to the techniques known as regularized kernel methods[1] [2][3] and regularization networks[4]. The work championed by Poggio and other researchers[5][6] implicitly treats learning as an approximation problem and gives a general scheme with ideas going back to modern regularization theory[7][8][9]. For both frameworks, a solution is sought by simultaneously minimizing the empirical error and the complexity. More precisely, given a training set m D = (xi ; yi )i=1 , an estimator f : X  Y , where X is a closed subset of Rd and Y  R, is given by m 1i V (yi , f (xi )) +  f 2 min (1) K f HK m =1 where V is a convex loss function, f K is the norm of f in a reproducing kernel Hilbert space (RKHS) HK induced by a positive definite function (a kernel) Kx (x ) = K (x, x ), and  is a regularization parameter that makes a trade-off between the empirical error and the complexity.  f 2 is also called a regularizer. K According to representer theorem [10][11] [12], the minimizer of (1) admits a simple solution as a linear combination of translates of the kernel K by the training data im f = ci Kxi , ci  R, 1  i  m\n=1\n\n\f\nfor a variety of loss functions. Different loss functions lead to different learning algorithms. For 2 example, when used for classification, a squared-loss (y - f (x)) brings about the regularized least-squares classification (RLSC) algorithm[13][14][15]; while a hinge loss (1 - y f (x)) +  max (1 - y f (x) , 0) corresponds to the classical support vector machines(SVM). Using this model, data are implicitly projected onto the hypothesis space H K via a transformation K : x  K x and a linear functional is sought by finding its representer in HK , which generally has infinite dimensions. It is generally believed that learning problems associated with infinite dimensions are ill-posed and need regularization. However, finite dimensional problems are often associated with well-posedness and do not need regularization. Motivated by this, we unified these two views in this paper. Using an existing trick in designing kernels, an RKHS is constructed which contains a subspace spanned by some predefined features and this subspace is left unregularized during the learning process. Empirical results have shown the embedding of these features often has the effect of stabilizing the algorithms's performance for different choices of kernels and prevents the results from deteriorating for inappropriate kernels. The paper is organized as follows. First, a generalized regularized learning model and its associated representer theorem are studied. Then, we introduce an existing trick with which we constructed a hypothesis space which has a subspace of the predefined features. Next, a generic learning algorithm is proposed based on the model and especially evaluated for classification problems. Empirical results have confirmed the benefits brought by the algorithm. A note on notation. Throughout the paper, vectors and matrices are represented in bold notation and scalars in normal script, e.g. x1 ,    , xm  Rd , K  Rmm , and y1 ,    , ym  R. I and O are used to denote an identity matrix and a zero matrix of appropriate sizes, respectively. For clarity, the size of a matrix is sometimes added as a subscript, such as Om .\n\n2 Generalized regularized least-squares learning model\nSuppose the space HK decomposes into the direct sum: HK = H0  H1 , where H0 is spanned by ( m) linearly independent features: H0 = span (1 ,    ,  ). We propose the generalized regularized least-squares (G-RLS) learning model as min L (f ) =\nm 1i (yi - f (xi ))2 +  f - P f 2 , K m =1\n\nf HK\n\n(2)\n\nwhere P f is the orthogonal projection of f onto H0 .\n\nSuppose f  is the minimizer of (2). For any f  HK , let f = f  +  g where   R and g  HK . Now take derivative w.r.t.  and notice that  L |=0 = 0 . Then  -\nm 2i (yi - f  (xi )) g (xi ) + 2 f  - P f  , g K = 0, m =1\n\n(3)\n\nwhere ,  K denotes the inner product in HK . This equation holds for any g  HK . In particular, setting g = Kx gives m (yi - f  (xi )) Kxi f  - P f  = i=1 . (4) m P f  is the orthogonal projection of f  onto H0 and hence,\np\n\nP f =\n=1\n\np p , p  R, 1  p  .\n\n(5)\n\nSo (4) is simplified to\np\n\nf =\n=1\n\np  p +\n\nim\n\nc i Kxi ,\n\n(6)\n\n=1\n\n\f\nwhere\n\nyi - f  (xi ) , 1  i  m. (7) m The coefficients 1 ,    ,  , c1 ,    , cm are uniquely specified by m + linear equations. The first m equations are obtained by substituting (6) into (7). The rest equations are derived from the orthogonality constraint between P f  and f - P f  , which can be written as K m i c i Kxi = 0, 1  p  , (8) p, ci =\n=1\n\nor equivalently due to the property of reproducing kernels, im\n\nci p (xi ) = 0, 1  p  .\nm\n\n(9)\n\n=1\n\nThe solution (6) derived from (2) satisfies the reproduction property. Suppose (x i ; yi )i=1 comes purely from a model which is perfectly linearly related to 1 ,    ,  , it is desirable to get back a solution that is independent of the other features. As an evident result of (2), the property is satisfied. The parameters c1 ,    , cm in the resulting estimator (6) are all zero, which makes the regularizer in (2) equal to zero.\n\n3 Kernel construction\nBy decomposing a hypothesis space HK and studying a generalized regularizer, we have proposed the G-RLS model and derived a solution which consists of predefined features as well as translates of a kernel function. In this section, starting with predefined features  1 ,    ,  and a kernel , we will construct a hypothesis space which contains the features and translates of the kernel by using an existing trick. 3.1 A kernel construction trick Let's consider the following reproducing kernel\np\n\nK (x, x ) = H (x, x ) +\n=1\n\np (x) p (x )\n\n(10)\n\nwhere\np q\n\nH (x, x ) =  (x, x ) -\n=1 p q\n\np (x)  (xp , x ) -\n=1\n\n q ( x )  ( x, x q )\n\n(11)\n\n+\n=1 =1\n\np (x) q (x )  (xp , xq ) ,\nd\n\n is any strictly positive definite function, and 1 ,    ,  1 ,    ,  w.r.t. x1 ,    , x , = 1 1 ( x ) (x) 1 (x1 )        (x1 )     (x )  (x) which satisfies  (xp ) =\nq\n\nefines a linear transformation of  (x)   (x)\n1\n\n-1\n\n(\n\n12)\n\n1 0\n\n1p=q p=q\n\n.\n\n(13)\n\nThis trick is studied in [16] to provide an alternative basis for radial basis functions and first used in a fast RBF interpolation algorithm[17]. A sketch of properties which are peripheral to our concerns in this paper are given below. Kxp =  p , 1  p  (14)\n\n\f\nH\n\nAnother property is that the matrix H = (H be used in the computations below.\n\nHxp = H (xp , ) = 0, 1  p  K H = (xi , ) , p = 0, + 1  i  m, 1  p  xi ,  H K = H (xi , xj ) , + 1  i, j  m x i , Hx j\np\n\np\n\n, q\n\nK\n\n=\n\n1 0\n\n1p=q p=q\n\n(\n\n15)\n\n(16) (17) (18)\n\nK\n\nm (xi , xj ))i,j = +1\n\nis strictly positive definite, which will\n\nBy constructing a kernel K using this trick, predefined features 1 ,    ,  are explicitly mapped onto HK which has a subspace H0 = span (1 ,    ,  ) = span (1 ,    ,  ). By property (15), we can see that 1 ,    ,  also forms an orthonormal basis of H0 . 3.2 Computation After projecting the features 1 ,    ,  onto an RKHS HK , let's study the regularized minimization problem in (2). As shown in (6), the minimizer has a form of a linear combination of predefined features and translates of a kernel. By the properties of K in (14)-(17), the minimizer can be rewritten as:\np\n\nf\n\n=\n=1\n\np  p +\np p p\n\n=\n\nim\n\nc i Kxi\nm\n\n=1\n\n=\n=1 p\n\n + \np\n\ni\n\nci  +\nm\n\ni\n\n=1\n\ni\n\nci\n\n= +1\n\nH \np\n\np xi\n\n+\n=1 m\n\np (xi ) p\n\n+ cp +\nm\n\n=1 p\n\ni\n\nci p (xi )\n\n+\n\n= +1\n\ni\n\nc i Hx i (19)\n\n= +1\n\n=\n=1\n\n~ p  p +\n\ni\n\nc i Hx i ~\n\n= +1\n\n~ ~~ where 1 ,    ,  , c +1 ,    , cm are m parameters to be determined. Furthermore, from the orthog~ onal property between p and Hxi in (17), we have\nm\n\nf - P f =\n\ni\n\nc i Hx i . ~\n\n(20)\n\n= +1\n\n~ ~ To determine the values of ~ = 1 ,    ,  \nm\n\nT\n\nand ~ = (c +1 ,    , cm ) , we need c ~ ~ ~  ~ c T ~ H ~  ~ c ( 21)\n\nT\n\nf - P f 2 = K ~ where H = O\n )\n\ni\n\nc i c j H ( x i , xj ) = ~~\n,j = +1\n\nO y\n\n(m- )\n\nO(m- 1 L= m I\n T\n\nH ~ -K\n(m- )\n\n. T\n\nSubstituting (21) into (2), we have y ~ -K p ~  ~ c ,m + ~  ~ c T ~ H ~  ~ c ( 22)\n\n~  ~ c a\n\n\n\na ~ where K =\n\nnd E = (xi ) p=1,i= +1 . Take derivative w.r.t. E H nd set the derivative to zero, and we get ~+ ~ = ~ ~ ~ K2   mH  Ky. ~ c ~ c\n\nO\n\n~  ~ c (23)\n\n\f\n~ Since K-1 = we have i.e.\n\n-H\n\nI\n\n -1\n\nET\n\nO(m- ) H-1\n\na\n\n~~ nd K-1 H = ~ = I ~  ~ c ~  ~ c = y, y\n1\n\nO(m-\n\nO\n\n )\n\nO\n\n(m- ) )(m- )\n\nI(m-\n\n,\n\n~ K +  m~ I I E\n T\n\n(24) , (25)\n\nO (m- ) H +  mI\n+1 ,   \n\nwhere y1 = (y1 ,    , y and ~ by c\n\nT )\n\nand y2 = (y\n\nT ~ , ym ) . Equation (25) uniquely specifies  by\n\n=\n\ny2\n\n~ = y1 , \n\n(26)\n\n(H +  mI) ~ = y2 - ET ~.  c (27) H +  mI is a strictly positive definite matrix. The equation can be efficiently (olved eitheby conjus r m3 . 3 gate gradient or by Cholesky factorization. The worst case complexity is O m - ) O It is also possible to investigate iterative methods for solving linear systems coupled with recent advances in fast matrix-vector multiplication methods (e.g. fast multipole method), and the complexity reduces to nearly O (m log m), which provides the potential to solve large scale problems.\n\n4 A generic learning algorithm\nBased on the discussions above, a generic learning algorithm (G-RLS algorithm) is summarized below. 1. Start with data (xi ; yi )i=1 . 2. For ( m) predefined linearly independent features 1 ,    ,  of the data, define 1 ,    ,  according to equation (12). 3. Choose a symmetric, strictly positive definite function x (x ) =  (x, x ) which is continuous on X  X . Define H according to equation (11). 4. The estimator f : X  Y is given by\np m m\n\nf (x) =\n=1\n\n~ p p (x) +\n\n~ ~~ where 1 ,    ,  , c +1 ,    , cm are obtained by solving equations (26) and (27). ~ The algorithm can be applied to a number of applications including regression and binary classification. As a simple example for regression, noisy points were randomly generated via a function y = |5 - x|, and we fitted the data by a curve. Polynomial features up to the second degree (1 = 1, 2 = x, 3 = x2 ) were used for G-RLS algorithm along with a Gaussian RBF kernel x () = e- 2 . We selected ridge regression with the Gaussian RBF kernel for a comparison, which can be regarded as an implementation of standard regularized least-squares model for regression tasks. For both algorithms, three trials were made in which the parameter  was set to a large value, to a small value, and by cross validation respectively. For each  , the parameter  was set by cross validation. Comparing with ridge regression in figure 1(b), the existence of polynomial features in G-RLS has the effect of stabilizing the results, as shown in figure 1(a). Varying  , different fitting results were obtained by ridge regression. However, for G-RLS algorithm, the difference was not evident. In the case of generalized regularized least-squares classification (G-RLSC), each y i of the training set takes the values {-1, 1}. The predicted label of any x depends on the sign of (28) 1 , f (x) > 0 y= . -1 otherwise G-RLSC uses the \"classical\" squared-loss as a classification loss criterion. The effectiveness of this criterion has been reported by the empirical results[13][14][15].\nx- 2\n\ni\n\nci Hxi (x) ~\n\n(28)\n\n= +1\n\n\f\n5\n\ndata  =cv\n2 2 2\n\n5\n\ndata  =cv\n2 2 2\n\n4\n\n =1000  =0.001\n\n4\n\n =1000  =0.001\n\n3\n\n3\n\n2\n\n2\n\n1\n\n1\n\n0\n\n0\n\n-1 -5\n\n-4\n\n-3\n\n-2\n\n-1\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n-1 -5\n\n-4\n\n-3\n\n-2\n\n-1\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(a) G-RLS Regression\n\n(b) Ridge Regression\n\nFigure 1: A Regression Example. The existence of polynomial features in G-RLS helped to improve the stability of the algorithm.\n\n5 Experiments\nTo evaluate the performance of G-RLS algorithm, empirical results are reported on text categorization tasks using the three datasets from CMU text mining group1. The 7-sectors dataset has 4, 573 web pages belonging to seven economic sectors, with each sector containing pages varying from 300 to 1, 099. The 4-universities dataset consists of 8, 282 webpages collected mainly from four universities, in which the pages belong to seven classes and each class has 137 to 3, 764 pages. The 20-newsgroups dataset collects UseNet postings into twenty newsgroups and each group has about 1, 000 messages. We experimented with its four major subsets. The first subset has 5 groups (comp.*), the second 4 groups (rec.*), the third 4 groups (sci.*) and the last 4 groups (talk.*). For each dataset, we removed all but the 2, 000 words with highest mutual information with the class variable by rainbow package[18]. The document was represented as bag-of-words with linear normalization into [-1, 1]. Probabilistic latent semantic analysis[19] (pLSA) was used to get ten latent features 1 ,    , 10 out of the data. Experiments were carried out with different number (100~3, 200) of data for training and the rest for testing. Each experiment consisted of ten runs and the average accuracy is reported. In each run, the data were separated by the xval-prep utility accompanied in C4.5 package2. Figure 2 compares the performance of G-RLSC, RLSC and SVM. It is shown that G-RLSC reports improved results on most of the datasets except on 4-universities. Moreover, an insightful observation may find that although SVM excels on the dataset when the number of training data increases, G-RLSC shows better performance than standard RLSC. A possible reason is that the hinge loss used by SVM is more appropriate than the squared-loss used by RLSC and G-RLSC on this dataset; while the embedding of pLSA features still improves the accuracy.\n\n6 Conclusion\nIn this paper, we first proposed a generic G-RLS learning model. Unlike the standard kernel-based methods which only consider the translates of a kernel for model learning, the new model takes predefined features into special consideration. A generalized regularizer is studied which leaves part of the hypothesis space unregularized. Similar ideas were explored in spline smoothing[9] in which low degree polynomials are not regularized. Another example is semi-parametric SVM[2], which considers the addition of some features to the kernel expansion for SVM. However, to our knowledge, few learning algorithms and applications have been studied along this line from a unified RKHS regularization point of view, or investigated for empirical evaluations. The second part of our work presented a practical computation method based on the model. An RKHS that contains the combined solutions is explicitly constructed based on a special trick in designing kernels. (The idea of a conditionally positive definite function[20] is lurking in the back1 2\n\nhttp://www.cs.cmu.edu/~TextLearning/datasets.html http://www.rulequest.com/Personal/c4.5r8.tar.gz.\n\n\f\nF\n\nigure 2: Classification accuracies on CMU text datasets with different number of training samples. Ten pLSA features along with a linear kernel  were used for G-RLSC. Both bag-of-words (BoW) and pLSA representations of documents were experimented for RLSC and SVM with a linear kernel. The parameter  was selected via cross validation. For multi-classification, G-RLSC and RLSC used one-versus-all strategy. SVM used one-versus-one strategy. ground of this trick, which goes beyond the discussion of this paper.) With the construction of the RKHS, the computation is further optimized and the theoretical analysis of such algorithms is also potentially facilitated. We evaluated G-RLS learning algorithm in text categorization. The empirical results from real-world applications have confirmed the effectiveness of the algorithm. Acknowledgments The authors thank Dr. Haixuan Yang for useful discussions. This research was partially supported by RGC Earmarked Grant #4173/04E and #4132/05E of Hong Kong SAR and RGC Research Grant Direct Allocation of the Chinese University of Hong Kong.\n\n\f\nReferences\n[1] V.N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998. [2] B. Scholkopf and A.J. Smola. Learning with Kernels. The MIT Press, 2002.  [3] J.S. Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [4] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Adv. Comput. Math., 13:150, 2000. [5] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978982, 1990. [6] T. Poggio and S. Smale. The mathematics of learning: Dealing with data. Not. Am. Math. Soc, 50:537 544, 2003. [7] A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. Winston and Sons, 1977. [8] V.A. Morozov. Methods for Solving Incorrectly Posed Problems. Springer-Verlag, 1984. [9] G. Wahba. Spline Models for Observational Data. SIAM, 1990. [10] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33:8295, 1971. [11] F. Girosi, M.J. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Comput., 7:219269, 1995. [12] B. Scholkopf, R. Herbrich, and A.J. Smola. A generalized representer theorem. In COLT'2001 and  EuroCOLT'2001, 2001. [13] R.M. Rifkin. Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002. [14] G. Fung and O.L. Mangasarian. Proximal support vector machine classifiers. In KDD'01, 2001. [15] J.A.K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Process. Lett., 9:293300, 1999. [16] W. Light and H. Wayne. Spaces of distributions, interpolation by translates of a basis function and error estimates. J. Numer. Math., 81:415450, 1999. [17] R.K. Beatson, W.A. Light, and S. Billings. Fast solution of the radial basis function interpolation equations: Domain decomposition methods. SIAM J. Sci. Comput., 22:17171740, 2000. [18] A.K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996. [19] T. Hofmann. Probabilistic latent semantic analysis. In UAI'99, 1999. [20] C.A. Micchelli. Interpolation of scattered data: Distances, matrices, and conditionally positive definite functions. Constr. Approx., 2:1122, 1986.\n\n\f\n", "award": [], "sourceid": 3014, "authors": [{"given_name": "Wenye", "family_name": "Li", "institution": null}, {"given_name": "Kin-hong", "family_name": "Lee", "institution": null}, {"given_name": "Kwong-sak", "family_name": "Leung", "institution": null}]}