{"title": "Kernels for Multi--task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": " Kernels for Multitask Learning\n\n\n\n Charles A. Micchelli\n Department of Mathematics and Statistics\n State University of New York,\n The University at Albany\n 1400 Washington Avenue, Albany, NY, 12222, USA\n\n\n Massimiliano Pontil\n Department of Computer Sciences\n University College London\n Gower Street, London WC1E 6BT, England, UK\n\n\n\n Abstract\n\n This paper provides a foundation for multitask learning using reproducing ker-\n nel Hilbert spaces of vectorvalued functions. In this setting, the kernel is a\n matrixvalued function. Some explicit examples will be described which go be-\n yond our earlier results in [7]. In particular, we characterize classes of matrix\n valued kernels which are linear and are of the dot product or the translation invari-\n ant type. We discuss how these kernels can be used to model relations between\n the tasks and present linear multitask learning algorithms. Finally, we present a\n novel proof of the representer theorem for a minimizer of a regularization func-\n tional which is based on the notion of minimal norm interpolation.\n\n\n\n1 Introduction\n\nThis paper addresses the problem of learning a vectorvalued function f : X Y, where\nX is a set and Y a Hilbert space. We focus on linear spaces of such functions that admit a\nreproducing kernel, see [7]. This study is valuable from a variety of perspectives. Our main\nmotivation is the practical problem of multitask learning where we wish to learn many\nrelated regression or classification functions simultaneously, see eg [3, 5, 6]. For instance,\nimage understanding requires the estimation of multiple binary classifiers simultaneously,\nwhere each classifier is used to detect a specific object. Specific examples include locating\na car from a pool of possibly similar objects, which may include cars, buses, motorbikes,\nfaces, people, etc. Some of these objects or tasks may share common features so it would\nbe useful to relate their classifier parameters. Other examples include multimodal human\ncomputer interface which requires the modeling of both, say, speech and vision, or tumor\nprediction in bioinformatics from multiple microarray datasets.\n\nMoreover, the spaces of vectorvalued functions described in this paper may be useful for\nlearning continuous transformations. In this case, X is a space of parameters and Y a\nHilbert space of functions. For example, in face animation X represents pose and expres-\nsion of a face and Y a space of functions IR2 IR, although in practice one considers\ndiscrete images in which case f (x) is a finite dimensional vector whose components are\n\n\f\nassociated to the image pixels. Other problems such as image morphing, can be formulated\nas vectorvalued learning.\n\nWhen Y is an n-dimensional Euclidean space, one straightforward approach in learning a\nvectorvalued function f = (f1, . . . , fn) consists in separately representing each compo-\nnent of f by a linear space of smooth functions and then learn these components indepen-\ndently, for example by minimizing some regularized error functional. This approach does\nnot capture relations between components of f (which are associated to tasks or pixels in\nthe examples above) and should not be the method of choice when these relations occur. In\nthis paper we investigate how kernels can be used for representing vectorvalued functions.\nWe proposed to do this by using a matrixvalued kernel K : X X IRnn that reflects\nthe interaction amongst the components of f . This paper provides a foundation for this\napproach. For example, in the case of support vector machines (SVM's) [10], appropriate\nchoices of the matrixvalued kernel implement a tradeoff between large margin of each\npertask SVM and large margin of combinations of these SVM's, eg their average.\n\nThe paper is organized as follows. In section 2 we formalize the above observations and\nshow that reproducing Hilbert spaces (RKHS) of vectorvalued functions admit a kernel\nwith values which are bounded linear operators on the output space Y and characterize the\nform some of these operators in section 3. Finally, in section 4 we provide a novel proof\nfor the representer theorem which is based on the notion of minimal norm interpolation and\npresent linear multitask learning algorithms.\n\n\n2 RKHS of vectorvalued functions\n\nLet Y be a real Hilbert space with inner product (, ), X a set, and H a linear space of func-\ntions on X with values in Y. We assume that H is also a Hilbert space with inner product\n , . We present two methods to enhance standard RKHS to vectorvalued functions.\n\n\n2.1 Matrixvalued kernels based on Aronszajn\n\nThe first approach extends the scalar case, Y = IR, in [2].\n\nDefinition 1 We say that H is a reproducing kernel Hilbert space (RKHS) of functions\nf : X Y, when for any y Y and x X the linear functional which maps f H to\n(y, f (x)) is continuous on H.\n\nWe conclude from the Riesz Lemma (see, e.g., [1]) that, for every x X and y Y, there\nis a linear operator Kx : Y H such that\n (y, f (x)) = Kxy, f . (2.1)\n\nFor every x, t X we also introduce the linear operator K(x, t) : Y Y defined, for\nevery y Y, by\n K(x, t)y := (Kty)(x). (2.2)\nIn the proposition below we state the main properties of the function K. To this end,\nwe let L(Y) be the set of all bounded linear operators from Y into itself and, for every\nA L(Y), we denote by A its adjoint. We also use L+(Y) to denote the cone of positive\nsemidefinite bounded linear operators, i.e. A L+(Y) provided that, for every y Y,\n(y, Ay) 0. When this inequality is strict for all y = 0 we say A is positive definite.\nWe also denote by INm the set of positive integers up to and including m. Finally, we say\nthat H is normal provided there does not exist (x, y) X (Y\\{0}) such that the linear\nfunctional (y, f (x)) = 0 for all f H.\n\nProposition 1 If K(x, t) is defined, for every x, t X , by equation (2.2) and Kx is given\nby equation (2.1) then the kernel K satisfies, for every x, t X , the following properties:\n\n\f\n (a) For every y, z Y, we have that (y, K(x, t)z) = Ktz, Kxy .\n\n (b) K(x, t) L(Y), K(x, t) = K(t, x), and K(x, x) L+(Y).\n\n Moreover, K(x, x) is positive definite for all x X if and only if H is normal.\n\n (c) For any m IN, {xj : j INm} X , {yj : j INm} Y we have that\n\n (yj, K(xj, x )y ) 0. (2.3)\n j, INm\n\n\nPROOF. We prove (a) by merely choosing f = Ktz in equation (2.1) to obtain that\n\n Kxy, Ktz = (y, (Ktz)(x)) = (y, K(x, t)z). (2.4)\n\nConsequently, from this equation, we conclude that K(x, t) admits an algebraic adjoint\nK(t, x) defined everywhere on Y and, so, the uniform boundness principle, see, eg, [1,\np. 48] implies that K(x, t) L(Y) and K(x, t) = K(t, x). Moreover, choosing t = x\nin (a) proves that K(x, x) L+(Y). As for the positive definiteness of K(x, x), merely\nuse equations (2.1) and property (a). These remarks prove (b). As for (c), we again use\nproperty (a) to obtain that\n\n (y 2\n j , K (xj , x )y ) = Kx y K y 0.\n j j , Kx y = xj j\n\n j, INm j, INm jINm\n\nThis completes the proof.\n\n\nFor simplicity, we say that K : X X L(Y) is a matrixvalued kernel (or simply\na kernel if no confusion will arise) if it satisfies properties (b) and (c). So far we have\nseen that if H is a RKHS of vectorvalued functions, there exists a kernel. In the spirit of\nthe Moore-Aronszajn's theorem for RKHS of scalar functions [2], it can be shown that if\nK : X X L(Y) is a kernel then there exists a unique (up to an isometry) RKHS of\nfunctions from X to Y which admits K as the reproducing kernel. The proof parallels the\nscalar case.\n\nGiven a vectorvalued function f : X Y we associate to it a scalarvalued function\nF : X Y IR defined by\n\n F (x, ) := (, f (x)), x X , Y. (2.5)\n\nWe let H1 be the linear space of all such functions. Thus, H1 consists of functions which\nare linear in their second variable. We make H1 into a Hilbert space by choosing F =\n f . It then follows that H1 is a RKHS with reproducing scalarvalued kernel defined, for\nall (x, y), (t, z) X Y, by the formula\n\n K1((x, y), (t, z)) := (y, K(x, t)z). (2.6)\n\n\n2.2 Feature map\n\nThe second approach uses the notion of feature map, see e.g. [9]. A feature map is a\nfunction : X Y W where W is a Hilbert space. A feature map representation of a\nkernel K has the property that, for every x, t X and y, z Y there holds the equation\n\n ((x, y), (t, z)) = (y, K(x, t)z).\n\nFrom equation (2.4) we conclude that every kernel admits a feature map representation\n(a Mercer type theorem) with W = H. With additional hypotheses on H and Y this\nrepresentation can take a familiar form\n\n K q(x, t) = (x)q(t), , q IN.\n r r (2.7)\n rIN\n\n\f\nMuch more importantly, we begin with a feature map (x, ) = (( (x), ) : IN)\nwhere W, this being the space of squared summable sequence on IN. We wish to learn\na function f : X Y where Y = W and f = (f : IN) with f = (w, ) :=\n w\n rIN r r for each IN, where w W. We choose f = w and conclude that\nthe space of all such functions is a Hilbert space of function from X to Y with kernel (2.7).\nThese remarks connect feature maps to kernels and vice versa. Note a kernel may have\nmany maps which represent it and a feature map representation for a kernel may not be the\nappropriate way to write it for numerical computations.\n\n\n3 Kernel construction\n\nIn this section we characterize a wide variety of kernels which are potentially useful for\napplications.\n\n\n3.1 Linear kernels\n\nA first natural question concerning RKHS of vectorvalued functions is: if X is IRd what\nis the form of linear kernels? In the scalar case a linear kernel is a quadratic form, namely\nK(x, t) = (x, Qt), where Q is a d d positive semidefinite matrix. We claim that for\nY = IRn any linear matrixvalued kernel K = (K q : , q INn) has the form\n\n K q(x, t) = (B x, Bqt), x, t IRd (3.8)\n\nwhere B are p d matrices for some p IN. To see that such K is a kernel simply note\nthat K is in the Mercer form (2.7) for (x) = B x. On the other hand, since any linear\nkernel has a Mercer representation with linear features, we conclude that all linear kernels\nhave the form (3.8). A special case is provided by choosing p = d and B to be diagonal\nmatrices.\n\nWe note that the theory presented in section 2 can be naturally extended to the case where\neach component of the vectorvalued function has a different input domain. This situation\nis important in multitask learning, see eg [5]. To this end, we specify sets X , INn,\nfunctions g : X IR, and note that multitask learning can be placed in the above\nframework by defining the input space\n\n X := X1 X2 Xn.\n\nWe are interested in vectorvalued functions f : X IRn whose coordinates are given by\nf (x) = g (P x), where x = (x : x X , INn) and P : X X is a projection\noperator defined, for every x X by P (x) = x , INn. For , q INn, we suppose\nkernel functions C q : X Xq IR are given such that the matrix valued kernel whose\nelements are defined as\n\n K q(x, t) := C q(P x, Pqt), , q INn\n\nsatisfies properties (b) and (c) of Proposition 1. An example of this construction is pro-\nvided again by linear functions. Specifically, we choose X = IRd , where d IN and\nC q(x , tq) = (Q x , Qqtq), x X , tq Xq, where Q are p d matrices. In this case,\nthe matrixvalued kernel K = (K q : , q INn) is given by\n\n K q(x, t) = (Q P x, QqPqt) (3.9)\n\nwhich is of the form in equation (3.8) for B = Q P , INn.\n\n3.2 Combinations of kernels\n\nThe results in this section are based on a lemma by Schur which state that the elementwise\nproduct of two positive semidefinite matrices is also positive semidefinite, see [2, p. 358].\n\n\f\nThis result implies that, when Y is finite dimensional, the elementwise product of two\nmatrixvalued kernels is also a matrixvalued kernel. Indeed, in view of the discussion at\nthe end of section 2.2 we immediately conclude the following two lemma hold.\n\nLemma 1 If Y = IRn and K1 and K2 are matrixvalued kernels then their elementwise\nproduct is a matrixvalued kernel.\n\nThis result allows us, for example, to enhance the linear kernel (3.8) to a polynomial kernel.\nIn particular, if r is a positive integer, we define, for every , q INn,\n K q(x, t) := (B x , Bqtq)r\nand conclude that K = (K q : , q INn) is a kernel.\n\nLemma 2 If G : IRd IRd IR is a kernel and z : X IRd a vectorvalued function,\nfor INn then the matrixvalued function K : X X IRnn whose elements are\ndefined, for every x, t X , by\n K q(x, t) = G(z (x), zq(t))\nis a matrixvalued kernel.\n\nThis lemma confirms, as a special case, that if z (x) = B x with B a p d matrix,\n INn, and G : IRd IRd IR is a scalarvalued kernel, then the function (3.8) is\na matrixvalued kernel. When G is chosen to be a Gaussian kernel, we conclude that\nK q(x, t) = exp(- B x - Bqt 2) is a matrixvalued kernel.\n\nIn the scalar case it is wellknown that a nonnegative combination of kernels is a kernel.\nThe next proposition extends this result to matrixvalued kernels.\n\nProposition 2 If Kj, j INs, s IN are scalarvalued kernels and Aj L+(Y) then the\nfunction\n K = AjKj (3.10)\n jINs\nis a matrixvalued kernel.\n\nPROOF. For any x, t X and c, d Y we have that\n\n (c, K(x, t)z) = (c, Ajd)Kj(x, t)\n jINs\n\nand so the proposition follows form the Schur lemma.\n\n\nOther results of this type can be found in [7]. The formula (3.10) can be used to generate\na wide variety of matrixvalued kernels which have the flexibility needed for learning. For\nexample, we obtain polynomial matrixvalued kernels by setting X = IRd and Kj(x, t) =\n(x, t)j, where x, t IRd. We remark that, generally, the kernel in equation (3.10) cannot be\nreduced to a diagonal kernel. An interesting case of Proposition 2 is provided by low rank\nkernels which may be useful in situations where the components of f are linearly related,\nthat is, for every f H and x X f (x) lies in a linear subspace M Y. In this case,\nit is desirable to use a kernel which has the same property that f (x) M, x X for all\nf H. We can ensure this by an appropriate choice of the matrices Aj. For example, if\nM = span({bj : j INs}) we may choose Aj = bjbj.\n\nMatrixvalued Gaussian mixtures are obtained by choosing X = IRd, Y = IRn, {j : j \nINs} IR+, and Kj(x, t) = exp(-j x - t 2). Specifically,\n\n K(x, t) = Aje-j x-t 2\n jINs\n\nis a kernel on X X for any {Aj : j INs} L+(IRn).\n\n\f\n4 Regularization and minimal norm interpolation\n\nLet V : Ym IR+ IR be a prescribed function and consider the problem of minimizing\nthe functional\n E(f ) := V (f (xj) : j INm), f 2 (4.11)\nover all functions f H. A special case is covered by the functional of the form\n\n E(f ) := Q(yj, f (xj)) + f 2 (4.12)\n jINm\n\nwhere is a positive parameter and Q : Y Y IR+ is some prescribed loss function,\neg the square loss. Within this general setting we provide a \"representer theorem\" for any\nfunction which minimizes the functional in equation (4.11). This result is well-known in\nthe scalar case. Our proof technique uses the idea of minimal norm interpolation, a central\nnotion in function estimation and interpolation.\n\nLemma 3 If y {(f (xj) : j INm) : f H} IRm the minimum of problem\n\n min f 2 : f (xj) = yj, j INm (4.13)\n\nis unique and admits the form ^\n f = K c\n jIN x j .\n m j\n\n\nWe refer to [7] for a proof. This approach achieves both simplicity and generality. For\nexample, it can be extended to normed linear spaces, see [8]. Our next result establishes\nthat the form of any local minimizer1 indeed has the same form as in Lemma 3. This result\nimproves upon [9] where it is proven only for a global minimizer.\n\nTheorem 1 If for every y Y m the function h : IR+ IR+ defined for t IR+ by\nh(t) := V (y, t) is strictly increasing and f0 H is a local minimum of E then f0 =\n K c\n jIN x j for some {cj : j INm} Y .\n m j\n\n\nProof: If g is any function in H such that g(xj) = 0, j INm and t a real number such\nthat |t| g , for > 0, then\n\n V y 2\n 0, f0 V y0, f0 + tg 2 .\n\nConsequently, we have that f 2\n 0 f0 + tg 2 from which it follows that (f0, g) = 0.\nThus, f0 satisfies\n\n f0 = min{ f : f (xj) = f0(xj), j INm, f H}\n\nand the result follows from Lemma 3.\n\n\n4.1 Linear regularization\n\nWe comment on regularization for linear multitask learning and therefore consider mini-\nmizing the functional\n\n R0(w) := Q(yj , (w, B xj)) + w 2 (4.14)\n jINm INn\n\nfor w IRp. We set u = Bw, u = (u : INn), and observe that the above functional\nis related to the functional\n\n R1(u) := Q(yj , (u , xj)) + J(u) (4.15)\n jINm INn\n\n 1A function f0 H is a local minimum for E provided that there is a positive number such that\nwhenever f H satisfies f0 - f then E(f0) E(f ).\n\n\f\nwhere we have defined the minimum norm functional\n\n J (u) := min{ w 2 : w IRp, Bw = u , INn}. (4.16)\n\nSpecifically, we have\n\n min{R0(w) : w IRp} = min{R1((B w : INn)) : w IRp}.\n\nThe optimal solution ^\n w of problem (4.16) is given by ^\n w = B c , where the vectors\n INn\n{c : INn} IRd satisfy the linear equations\n\n BBkck = u , INn\n kINn\n\nand\n J (u) = (u , ~\n B-1u\n q q )\n ,qINn\n\nprovided the d d block matrix ~\n B = (BBq : , q INn) is nonsingular. We note that this\nanalysis can be extended to the case of different inputs across the tasks by replacing xj in\nequations (4.14) and (4.15) by xj, IRd and matrix B by Q P , see section 3.1 for the\ndefinition of these quantities.\n\nAs a special example we choose B to be the (n + 1)d d matrix whose d d blocks\nare all zero expect for the 1-st and ( + 1)-th block which are equal to c-1Id and Id\nrespectively, where c > 0 and Id is the d-dimensional identity matrix. From equation\n(3.8) the matrixvalued kernel K in equation (3.8) reduce to\n\n 1\n K q(x, t) = ( + \n c2 q )(x, t), , q INn, x, t IRn. (4.17)\n\nMoreover, in this case the minimization in (4.16) is given by\n\n c2 n 1\n J (u) = u 2 + u - u 2. (4.18)\n n + c2 n + c2 n q\n INn INn qINn\n\nThe model of minimizing (4.14) was proposed in [6] in the context of support vector ma-\nchines (SVM's) for these special choice of matrices. The derivation presented here im-\nproves upon it. The regularizer (4.18) forces a tradeoff between a desirable small size\nfor pertask parameters and closeness of each of these parameters to their average. This\ntrade-off is controlled by the coupling parameter c. If c is small the tasks parameters are\nrelated (closed to their average) whereas a large value of c means the task are learned inde-\npendently. For SVM's, Q is the Hinge loss function defined by Q(a, b) := max(0, 1 - ab),\na, b IR. In this case the above regularizer trades off large margin of each pertask SVM\nwith closeness of each SVM to the average SVM. Numerical experiments showing the\ngood performance of the multitask SVM compared to both independent pertask SVM's\n(ie, c = in equation (4.17)) and previous multitask learning methods are also discussed\nin [6].\n\nThe analysis above can be used to derive other linear kernels. This can be done by either\nintroducing the matrices B as in the previous example, or by modifying the functional on\nthe right hand side of equation (4.15). For example, we choose an n n symmetric matrix\nA all of whose entries are in the unit interval, and consider the regularizer\n\n 1\n J (u) := u - u 2A (u , u\n 2 q q = q )L q (4.19)\n ,qINn ,qINn\n\nwhere L = D - A with D q = q A\n hIN h. The matrix A could be the weight matrix\n n\nof a graph with n vertices and L the graph Laplacian, see eg [4]. The equation A q = 0\n\n\f\nmeans that tasks and q are not related, whereas A q = 1 means strong relation. In order\nto derive the matrixvalued kernel we note that (4.19) can be written as (u, ~\n L, u) where\n~\nL is the n n block matrix whose , q block is the d d matrix IdL q. Thus, we define\nw = ~\n L 12 u so that we have u = P ~\n L- 12 w (here L-1 is the pseudoinverse), where P is\na projection matrix from IRdn to IRd. Consequently, the feature map in equation (2.7) is\ngiven by = B = ~\n L- 12 P and we conclude that\n\n K q(x, t) = (x, P ~\n L-1P t).\n q\nFinally, as discussed in section 3.2 one can form polynomials or non-linear functions of the\nabove linear kernels. From Theorem 1 the minimizer of (4.12) is still a linear combination\nof the kernel at the given data examples.\n\n\n5 Conclusions and future directions\n\nWe have described reproducing kernel Hilbert spaces of vectorvalued functions and dis-\ncussed their use in multitask learning. We have provided a wide class of matrixvalued\nkernels which should proved useful in applications. In the future it would be valuable to\nstudy learning methods, using convex optimization or MonteCarlo integration, for choosing\nthe matrixvalued kernel. This problem seems more challenging that its scalar counterpart\ndue to the possibly large dimension of the output space. Another important problem is to\nstudy error bounds for learning in these spaces. Such analysis can clarify the role played by\nthe spectra of the matrixvalued kernel. Finally, it would be interesting to link the choice\nof matrixvalued kernels to the notion of relatedness between tasks discussed in [5].\n\n\nAcknowledgments\n\nThis work was partially supported by EPSRC Grant GR/T18707/01 and NSF Grant No.\nITR-0312113. We are grateful to Zhongying Chen, Head of the Department of Scientific\nComputation at Zhongshan University for providing both of us with the opportunity to\ncomplete this work in a scientifically stimulating and friendly environment. We also wish\nto thank Andrea Caponnetto, Sayan Mukherjee and Tomaso Poggio for useful discussions.\n\n\nReferences\n\n [1] N.I. Akhiezer and I.M. Glazman. Theory of linear operators in Hilbert spaces, volume I. Dover\n reprint, 1993.\n\n [2] N. Aronszajn. Theory of reproducing kernels. Trans. AMS, 686:337404, 1950.\n\n [3] J. Baxter. A Model for Inductive Bias Learning. Journal of Artificial Intelligence Research, 12,\n p. 149198, 2000.\n\n [4] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and data Repre-\n sentation Neural Computation, 15(6):13731396, 2003.\n\n [5] S. Ben-David and R. Schuller. Exploiting Task Relatedness for Multiple Task Learning. Proc.\n of the 16th Annual Conference on Learning Theory (COLT'03), 2003.\n\n [6] T. Evgeniou and M.Pontil. Regularized Multitask Learning. Proc. of 17-th SIGKDD Conf. on\n Knowledge Discovery and Data Mining, 2004.\n\n [7] C.A. Micchelli and M. Pontil. On Learning Vector-Valued Functions. Neural Computation,\n 2004 (to appear).\n\n [8] C.A. Micchelli and M. Pontil. A function representation for learning in Banach spaces. Proc.\n of the 17th Annual Conf. on Learning Theory (COLT'04), 2004.\n\n [9] B. Scholkopf, R. Herbrich, and A.J. Smola. A Generalized Representer Theorem. Proc. of the\n 14-th Annual Conf. on Computational Learning Theory (COLT'01), 2001.\n\n[10] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.\n\n\f\n", "award": [], "sourceid": 2615, "authors": [{"given_name": "Charles", "family_name": "Micchelli", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}