{"title": "Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": null, "full_text": " Matrix Exponentiated Gradient Updates\n for On-line Learning and Bregman Projection\n\n\n\n Koji Tsuda, Gunnar Ratsch and Manfred K. Warmuth\n Max Planck Institute for Biological Cybernetics\n Spemannstr. 38, 72076 Tubingen, Germany\n AIST CBRC, 2-43 Aomi, Koto-ku, Tokyo, 135-0064, Japan\n Fraunhofer FIRST, Kekulestr. 7, 12489 Berlin, Germany\n University of California at Santa Cruz\n {koji.tsuda,gunnar.raetsch}@tuebingen.mpg.de, manfred@cse.ucsc.edu\n\n Abstract\n\n We address the problem of learning a symmetric positive definite matrix.\n The central issue is to design parameter updates that preserve positive\n definiteness. Our updates are motivated with the von Neumann diver-\n gence. Rather than treating the most general case, we focus on two key\n applications that exemplify our methods: On-line learning with a simple\n square loss and finding a symmetric positive definite matrix subject to\n symmetric linear constraints. The updates generalize the Exponentiated\n Gradient (EG) update and AdaBoost, respectively: the parameter is now\n a symmetric positive definite matrix of trace one instead of a probability\n vector (which in this context is a diagonal positive definite matrix with\n trace one). The generalized updates use matrix logarithms and exponen-\n tials to preserve positive definiteness. Most importantly, we show how\n the analysis of each algorithm generalizes to the non-diagonal case. We\n apply both new algorithms, called the Matrix Exponentiated Gradient\n (MEG) update and DefiniteBoost, to learn a kernel matrix from distance\n measurements.\n\n\n1 Introduction\n\nMost learning algorithms have been developed to learn a vector of parameters from data.\nHowever, an increasing number of papers are now dealing with more structured parame-\nters. More specifically, when learning a similarity or a distance function among objects,\nthe parameters are defined as a symmetric positive definite matrix that serves as a kernel\n(e.g. [14, 11, 13]). Learning is typically formulated as a parameter updating procedure to\noptimize a loss function. The gradient descent update [6] is one of the most commonly used\nalgorithms, but it is not appropriate when the parameters form a positive definite matrix,\nbecause the updated parameter is not necessarily positive definite. Xing et al. [14] solved\nthis problem by always correcting the updated matrix to be positive. However no bound\nhas been proven for this update-and-correction approach. In this paper, we introduce the\nMatrix Exponentiated Gradient update which works as follows: First, the matrix logarithm\nof the current parameter matrix is computed. Then a step is taken in the direction of the\nsteepest descent. Finally, the parameter matrix is updated to the exponential of the modified\nlog-matrix. Our update preserves symmetry and positive definiteness because the matrix\nexponential maps any symmetric matrix to a positive definite matrix.\n\n\f\nBregman divergences play a central role in the motivation and the analysis of on-line learn-\ning algorithms [5]. A learning problem is essentially defined by a loss function, and a di-\nvergence that measures the discrepancy between parameters. More precisely, the updates\nare motivated by minimizing the sum of the loss function and the Bregman divergence,\nwhere the loss function is multiplied by a positive learning rate. Different divergences lead\nto radically different updates [6]. For example, the gradient descent is derived from the\nsquared Euclidean distance, and the exponentiated gradient from the Kullback-Leibler di-\nvergence. We use the von Neumann divergence (also called quantum relative entropy) for\nmeasuring the discrepancy between two positive definite matrices [8]. We derive a new\nMatrix Exponentiated Gradient update from this divergence (which is a Bregman diver-\ngence for positive definite matrices). Finally we prove relative loss bounds using the von\nNeumann divergence as a measure of progress.\n\nAlso the following related key problem has received a lot of attention recently [14, 11,\n13]: Find a symmetric positive definite matrix that satisfies a number of symmetric linear\ninequality constraints. The new DefiniteBoost algorithm greedily chooses the most violated\nconstraint and performs an approximated Bregman projection. In the diagonal case, we\nrecover AdaBoost [9]. We also show how the convergence proof of AdaBoost generalizes\nto the non-diagonal case.\n\n2 von Neumann Divergence or Quantum Relative Entropy\n\nIf F is a real convex differentiable function on the parameter domain (symmetric d d\npositive definite matrices) and f (W) := F(W), then the Bregman divergence between\ntwo parameters W and W is defined as\n\n F(W, W) = F(W) - F(W) - tr[(W - W)f(W)].\nWhen choosing F(W) = tr(W log W -W), then f(W) = log W and the corresponding\nBregman divergence becomes the von Neumann divergence [8]:\n\n F(W, W) = tr(W log W - W log W - W + W). (1)\nIn this paper, we are primarily interested in the normalized case (when tr(W) = 1). In this\ncase, the positive symmetric definite matrices are related to density matrices commonly\nused in Statistical Physics and the divergence simplifies to F(W, W) = tr(W log W -\nW log W).\n\nIf W = \n i ivivi is our notation for the eigenvalue decomposition, then we can rewrite\nthe normalized divergence as\n\n ~ ~\n F(W, W) = i ln ~\n i + i ln j(~\n vi vj )2.\n i i,j\n\nSo this divergence quantifies the difference in the eigenvalues as well as the eigenvectors.\n\n3 On-line Learning\n\nIn this section, we present a natural extension of the Exponentiated Gradient (EG) up-\ndate [6] to an update for symmetric positive definite matrices.\n\nAt the t-th trial, the algorithm receives a symmetric instance matrix Xt Rdd. It then\nproduces a prediction ^\n yt = tr(WtXt) based on the algorithm's current symmetric positive\ndefinite parameter matrix Wt. Finally it incurs for instance1 a quadratic loss (^\n yt - yt)2,\n 1For the sake of simplicity, we use the simple quadratic loss: L W\n t (W) = (tr(Xt ) - yt)2.\nFor the general update, the gradient Lt(Wt) is exponentiated in the update (4) and this gradient\nmust be symmetric. Following [5], more general loss functions (based on Bregman divergences) are\namenable to our techniques.\n\n\f\nand updates its parameter matrix Wt. In the update we aim to solve the following problem:\n\n Wt+1 = argmin \n W F(W, Wt) + (tr(WXt) - yt)2 , (2)\nwhere the convex function F defines the Bregman divergence. Setting the derivative with\nrespect to W to zero, we have\n\n f (Wt+1) - f(Wt) + [(tr(Wt+1Xt) - yt)2] = 0. (3)\nThe update rule is derived by solving (3) with respect to Wt+1, but it is not solvable in\nclosed form. A common way to avoid this problem is to approximate tr(Wt+1Xt) by\ntr(WtXt) [5]. Then, we have the following update:\n\n Wt+1 = f -1(f (Wt) - 2(^yt - yt)Xt).\nIn our case, F(W) = tr(W log W - W) and thus f(W) = log W and f-1(W) =\nexp W. We also augment (2) with the constraint tr(W) = 1, leading to the following\nMatrix Exponential Gradient (MEG) Update:\n\n 1\n Wt+1 = exp(log W\n Z t - 2(^yt - yt)Xt), (4)\n t\n\nwhere the normalization factor Zt is tr[exp(log Wt - 2(^yt - yt)Xt)]. Note that in the\nabove update, the exponent log Wt - 2(^yt - yt)Xt is an arbitrary symmetric matrix and\nthe matrix exponential converts this matrix back into a symmetric positive definite matrix.\nA numerically stable version of the MEG update is given in Section 3.2.\n\n3.1 Relative Loss Bounds\n\nWe now begin with the definitions needed for the relative loss bounds. Let S =\n(X1, y1), . . . , (XT , yT ) denote a sequence of examples, where the instance matrices Xt \nRdd are symmetric and the labels yt R. For any symmetric positive semi-definite ma-\ntrix U with tr(U) = 1, define its total loss as LU(S) = T (tr(UX\n t=1 t ) - yt)2. The total\nloss of the on-line algorithm is LMEG(S) = T (tr(W\n t=1 tXt) - yt)2. We prove a bound\non the relative loss LMEG(S) -LU(S) that holds for any U. The proof generalizes a sim-\nilar bound for the Exponentiated Gradient update (Lemmas 5.8 and 5.9 of [6]). The relative\nloss bound is derived in two steps: Lemma 3.1 bounds the relative loss for an individual\ntrial and Lemma 3.2 for a whole sequence (Proofs are given in the full paper).\n\nLemma 3.1 Let Wt be any symmetric positive definite matrix. Let Xt be any symmetric\nmatrix whose smallest and largest eigenvalues satisfy max - min r. Assume Wt+1 is\nproduced from Wt by the MEG update and let U be any symmetric positive semi-definite\nmatrix. Then for any constants a and b such that 0 < a 2b/(2 + r2b) and any learning\nrate = 2b/(2 + r2b), we have\n\n a(yt - tr(WtXt))2 - b(yt - tr(UXt))2 (U,Wt) - (U,Wt+1) (5)\n\nIn the proof, we use the Golden-Thompson inequality [3], i.e., tr[exp(A + B)] \ntr[exp(A) exp(B)] for symmetric matrices A and B. We also needed to prove the fol-\nlowing generalization of Jensen's inequality to matrices: exp(1A + 2(I - A)) \nexp(1)A + exp(2)(I - A) for finite 1,2 R and any symmetric matrix A with\n0 < A I. These two key inequalities will also be essential for the analysis of Definite-\nBoost in the next section.\n\nLemma 3.2 Let W1 and U be arbitrary symmetric positive definite initial and comparison\nmatrices, respectively. Then for any c such that = 2c/(r2(2 + c)),\n\n c 1 1\n LMEG(S) 1 + LU(S) + + r2(U, W\n 2 2 c 1). (6)\n\n\f\nProof For the maximum tightness of (5), a should be chosen as a = = 2b/(2 + r2b).\nLet b = c/r2, and thus a = 2c/(r2(2 + c)). Then (5) is rewritten as\n\n 2c (y\n 2 + c t - tr(WtXt))2 - c(yt - tr(UXt))2 r2((U, Wt) - (U, Wt+1))\nAdding the bounds for t = 1, , T, we get\n 2c L\n 2 + c MEG(S) - cLU(S) r2((U, W1) - (U, Wt+1)) r2(U, W1),\n\nwhich is equivalent to (6).\n\nAssuming LU(S) max and (U, W1) dmax, the bound (6) is tightest when c =\nr 2dmax/ max. Then we have LMEG(S) - LU(S) r2 maxdmax + r2(U,W\n 2 1).\n\n3.2 Numerically stable MEG update\n\nThe MEG update is numerically unstable when the eigenvalues of Wt are around zero.\nHowever we can \"unwrap\" Wt+1 as follows:\n\n 1 t\n Wt+1 = exp(c (^\n y\n ~ tI + log W1 s\n Z - 2 - ys)Xs), (7)\n t s=1\n\nwhere the constant ~\n Zt normalizes the trace of Wt+1 to one. As long as the eigen values of\nW1 are not too small then the computation of log Wt is stable. Note that the update is inde-\npendent of the choice of ct R. We incrementally maintain an eigenvalue decomposition\nof the matrix in the exponent (O(n3) per iteration):\n\n t\n\n VttVTt = ctI + log W1 - 2 (^ys - ys)Xs),\n s=1\n\nwhere the constant ct is chosen so that the maximum eigenvalue of the above is zero. Now\nWt+1 = Vt exp(t)VTt /tr(exp(t)).\n\n4 Bregman Projection and DefiniteBoost\n\nIn this section, we address the following Bregman projection problem2\n\n W = argmin \n W F (W, W1), tr(W) = 1, tr(WCj ) 0, for j = 1, . . . , n, (8)\nwhere the symmetric positive definite matrix W1 of trace one is the initial parameter ma-\ntrix, and C1, . . . , Cn are arbitrary symmetric matrices. Prior knowledge about W is en-\ncoded in the constraints, and the matrix closest to W1 is chosen among the matrices satis-\nfying all constraints. Tsuda and Noble [13] employed this approach for learning a kernel\nmatrix among graph nodes, and this method can be potentially applied to learn a kernel\nmatrix in other settings (e.g. [14, 11]).\n\nThe problem (8) is a projection of W1 to the intersection of convex regions defined by the\nconstraints. It is well known that the Bregman projection into the intersection of convex\nregions can be solved by sequential projections to each region [1]. In the original papers\nonly asymptotic convergence was shown. More recently a connection [4, 7] was made to\nthe AdaBoost algorithm which has an improved convergence analysis [2, 9]. We generalize\nthe latter algorithm and its analysis to symmetric positive definite matrices and call the new\nalgorithm DefiniteBoost. As in the original setting, only approximate projections (Figure 1)\nare required to show fast convergence.\n\n 2Note that if is large then the on-line update (2) becomes a Bregman projection subject to a\nsingle equality constraint tr(WXt) = yt.\n\n\f\n Approximate Figure 1: In (exact) Bregman projections, the intersection\n Projection of convex sets (i.e., two lines here) is found by iterating pro-\n jections to each set. We project only approximately, so the\n projected point does not satisfy the current constraint. Nev-\n ertheless, global convergence to the optimal solution is guar-\n anteed via our proofs.\n Exact\n Projection\n\n\n\n\n\nBefore presenting the algorithm, let us derive the dual problem of (8) by means of Lagrange\nmultipliers ,\n n \n \n = argmin log \n trexp(logW1- jCj), j 0. (9)\n j=1\n\n\nSee [13] for a detailed derivation of the dual problem. When (8) is feasible, the opti-\nmal solution is described as W = 1 exp(log W C ) =\n Z( 1 j ), where Z(\n ) - nj=1 j\ntr[exp(log W1 - n C\n j=1 j j )].\n\n4.1 Exact Bregman Projections\n\nFirst, let us present the exact Bregman projection algorithm to solve (8). We start from\nthe initial parameter W1. At the t-th step, the most unsatisfied constraint is chosen,\njt = argmaxj=1, ,n tr(WtCj). Let us use Ct as the short notation for Cj . Then, the\n t\nfollowing Bregman projection with respect to the chosen constraint is solved.\n\n Wt+1 = argmin (W, W\n W t), tr(W) = 1, tr(WCt) 0. (10)\nBy means of a Lagrange multiplier , the dual problem is described as\n\n t = argmin tr[exp(log Wt - Ct)], 0. (11)\nUsing the solution of the dual problem, Wt is updated as\n\n 1\n Wt+1 = exp(log W\n Z t - tCt) (12)\n t(t)\n\nwhere the normalization factor is Zt(t) = tr[exp(log Wt -tCt)]. Note that we can use\nthe same numerically stable update as in the previous section.\n\n4.2 Approximate Bregman Projections\n\nThe solution of (11) cannot be obtained in closed form. However, one can use the following\napproximate solution:\n\n 1 1 + r\n t/max\n t\n t = log , (13)\n max\n t - min\n t 1 + rt/min\n t\n\nwhen the eigenvalues of Ct lie in the interval [min\n t , max\n t ] and rt = tr(WtCt). Since the\nmost unsatisfied constraint is chosen, rt 0 and thus t 0. Although the projection is\ndone only approximately,3 the convergence of the dual objective (9) can be shown using\nthe following upper bound.\n\n 3The approximate Bregman projection (with t as in (13) can also be motivated as an online\nalgorithm based on an entropic loss and learning rate one (following Section 3 and [4]).\n\n\f\nTheorem 4.1 The dual objective (9) is bounded as\n n T\n tr explogW1- jCj (rt) (14)\n j=1 t=1\n\n max min\n t -t\n rt max max\n t -min\n t rt t -min\n t\n where (rt) = 1 - 1 .\n max -\n t min\n t\n\n\nThe dual objective is monotonically decreasing, because (rt) 1. Also, since rt corre-\nsponds to the maximum value among all constraint violations {rj}nj=1, we have (rt) = 1\nonly if rt = 0. Thus the dual objective continues to decrease until all constraints are\nsatisfied.\n\n4.3 Relation to Boosting\n\nWhen all matrices are diagonal, the DefiniteBoost degenerates to AdaBoost [9]: Let\n{xi,yi}di=1 be the training samples, where xi Rm and yi {-1,1}. Let\nh1(x), . . . , hn(x) [-1,1] be the weak hypotheses. For the j-th hypothesis hj(x), let\nus define Cj = diag(y1hj(x1), . . . , ydhj(xd)). Since |yhj(x)| 1, max/min\n t = 1 for\nany t. Setting W1 = I/d, the dual objective (14) is rewritten as\n\n d n\n 1 \n exp \n d -yi jhj(xi),\n i=1 j=1\n\nwhich is equivalent to the exponential loss function used in AdaBoost. Since Cj and W1\nare diagonal, the matrix Wt stays diagonal after the update. If wti = [Wt]ii, the updating\nformula (12) becomes the AdaBoost update: wt+1,i = wti exp(-tyiht(xi))/Zt(t). The\napproximate solution of t (13) is described as t = 1 log 1+rt , where r\n 2 1-r t is the weighted\n t\ntraining error of the t-th hypothesis, i.e. rt = d w\n i=1 tiyiht(xi).\n\n5 Experiments on Learning Kernels\n\nIn this section, our technique is applied to learning a kernel matrix from a set of distance\nmeasurements. This application is not on-line per se, but it shows nevertheless that the\ntheoretical bounds can be reasonably tight on natural data.\n\nWhen K is a d d kernel matrix among d objects, then the Kij characterizes the similarity\nbetween objects i and j. In the feature space, Kij corresponds to the inner product between\nobject i and j, and thus the Euclidean distance can be computed from the entries of the\nkernel matrix [10]. In some cases, the kernel matrix is not given explicitly, but only a set\nof distance measurements is available. The data are represented either as (i) quantitative\ndistance values (e.g., the distance between i and j is 0.75), or (ii) qualitative evaluations\n(e.g., the distance between i and j is small) [14, 13]. Our task is to obtain a positive definite\nkernel matrix which fits well to the given distance data.\n\nOn-line kernel learning In the first experiment, we consider the on-line learning scenario\nin which only one distance example is shown to the learner at each time step. The distance\nexample at time t is described as {at, bt, yt}, which indicates that the squared Euclidean\ndistance between objects at and bt is yt. Let us define a time-developing sequence of kernel\nmatrices as {Wt}Tt=1, and the corresponding points in the feature space as {xti}di=1 (i.e.\n[Wt]ab = xtaxtb). Then, the total loss incurred by this sequence is\n T T\n 2 2\n xta = (tr(W\n t - xtbt - yt tXt) - yt)2,\n t=1 t=1\n\n\f\n 1.8 0.45\n\n 1.6 0.4\n\n 1.4\n 0.35\n 1.2\n 0.3\n 1\n 0.25\n 0.8\n Total Loss 0.2\n 0.6 Classification Error\n 0.4 0.15\n\n 0.2 0.1\n\n 0 0.05\n 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3\n 5\n Iterations 5\n Iterations\n x 10 x 10\n\nFigure 2: Numerical results of on-line learning. (Left) total loss against the number of iterations. The\ndashed line shows the loss bound. (Right) classification error of the nearest neighbor classifier using\nthe learned kernel. The dashed line shows the error by the target kernel.\n\n\n\nwhere Xt is a symmetric matrix whose (at, at) and (bt, bt) elements are 0.5, (at, bt) and\n(bt, at) elements are -0.5, and all the other elements are zero. We consider a controlled\nexperiment in which the distance examples are created from a known target kernel matrix.\nWe used a 52 52 kernel matrix among gyrB proteins of bacteria (d = 52). This data\ncontains three bacteria species (see [12] for details). Each distance example is created\nby randomly choosing one element of the target kernel. The initial parameter was set as\nW1 = I/d. When the comparison matrix U is set to the target matrix, LU (S) = 0 and\n max = 0, because all the distance examples are derived from the target matrix. Therefore\nwe choose learning rate = 2, which minimizes the relative loss bound of Lemma 3.2.\nThe total loss of the kernel matrix sequence obtained by the matrix exponential update is\nshown in Figure 2 (left). In the plot, we have also shown the relative loss bound. The\nbound seems to give a reasonably tight performance guarantee--it is about twice the actual\ntotal loss. To evaluate the learned kernel matrix, the prediction accuracy of bacteria species\nby the nearest neighbor classifier is calculated (Figure 2, right), where the 52 proteins are\nrandomly divided into 50% training and 50% testing data. The value shown in the plot\nis the test error averaged over 10 different divisions. It took a large number of iterations\n( 2 105) for the error rate to converge to the level of the target kernel. In practice one\ncan often increase the learning rate for faster convergence, but here we chose the small rate\nsuggested by our analysis to check the tightness of the bound.\n\n\n\nKernel learning by Bregman projection Next, let us consider a batch learning sce-\nnario where we have a set of qualitative distance evaluations (i.e. inequality constraints).\nGiven n pairs of similar objects {aj, bj}nj=1, the inequality constraints are constructed\nas xaj - xbj , j = 1, . . ., n, where is a predetermined constant. If Xj is de-\nfined as in the previous section and Cj = Xj - I, the inequalities are then rewritten as\ntr(WCj) 0,j = 1,...,n. The largest and smallest eigenvalues of any Cj are 1 - \nand -, respectively. As in the previous section, distance examples are generated from the\ntarget kernel matrix between gyrB proteins. Setting = 0.2/d, we collected all object\npairs whose distance in the feature space is less than to yield 980 inequalities (n = 980).\nFigure 3 (left) shows the convergence of the dual objective function as proven in Theo-\nrem 4.1. The convergence was much faster than the previous experiment, because, in the\nbatch setting, one can choose the most unsatisfied constraint, and optimize the step size as\nwell. Figure 3 (right) shows the classification error of the nearest neighbor classifier. As\nopposed to the previous experiment, the error rate is higher than that of the target kernel\nmatrix, because substantial amount of information is lost by the conversion to inequality\nconstraints.\n\n\f\n 55 0.8\n\n 50 0.7\n\n 45 0.6\n\n 40 0.5\n\n 35 0.4\n\n Dual Obj 30 0.3\n Classification Error\n 25 0.2\n\n\n 20 0.1\n\n\n 15 0\n 0 50 100 150 200 250 300 0 50 100 150 200 250 300\n Iterations Iterations\n\n\nFigure 3: Numerical results of Bregman projection. (Left) convergence of the dual objective function.\n(Right) classification error of the nearest neighbor classifier using the learned kernel.\n\n6 Conclusion\n\nWe motivated and analyzed a new update for symmetric positive matrices using the von\nNeumann divergence. We showed that the standard bounds for on-line learning and Boost-\ning generalize to the case when the parameters are a symmetric positive definite matrix (of\ntrace one) instead of a probability vector. As in quantum physics, the eigenvalues act as\nprobabilities.\n\nAcknowledgment We would like to thank B. Sch olkopf, M. Kawanabe, J. Liao and\nW.S. Noble for fruitful discussions. M.W. was supported by NSF grant CCR 9821087 and\nUC Discovery grant LSIT02-10110. K.T. and G.R. gratefully acknowledge partial support\nfrom the PASCAL Network of Excellence (EU #506778). Part of this work was done while\nall three authors were visiting the National ICT Australia in Canberra.\n\n\nReferences\n\n [1] L.M. Bregman. Finding the common point of convex sets by the method of successive projec-\n tions. Dokl. Akad. Nauk SSSR, 165:487490, 1965.\n [2] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an\n application to boosting. Journal of Computer and System Sciences, 55(1):119139, 1997.\n [3] S. Golden. Lower bounds for the Helmholtz function. Phys. Rev., 137:B1127B1128, 1965.\n [4] J. Kivinen and M. K. Warmuth. Boosting as entropy projection. In Proc. 12th Annu. Conference\n on Comput. Learning Theory, pages 134144. ACM Press, New York, NY, 1999.\n [5] J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression problems.\n Machine Learning, 45(3):301329, 2001.\n [6] J. Kivinen and M.K. Warmuth. Exponentiated gradient versus gradient descent for linear pre-\n dictors. Information and Computation, 132(1):163, 1997.\n [7] J. Lafferty. Additive models, boosting, and inference for generalized divergences. In Proc. 12th\n Annu. Conf. on Comput. Learning Theory, pages 125133, New York, NY, 1999. ACM Press.\n [8] M.A. Nielsen and I.L. Chuang. Quantum Computation and Quantum Information. Cambridge\n University Press, 2000.\n [9] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions.\n Machine Learning, 37:297336, 1999.\n[10] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[11] I.W. Tsang and J.T. Kwok. Distance metric learning with kernels. In Proceedings of the Inter-\n national Conference on Artificial Neural Networks (ICANN'03), pages 126129, 2003.\n[12] K. Tsuda, S. Akaho, and K. Asai. The em algorithm for kernel matrix completion with auxiliary\n data. Journal of Machine Learning Research, 4:6781, May 2003.\n[13] K. Tsuda and W.S. Noble. Learning kernels from biological networks by maximizing entropy.\n Bioinformatics, 2004. to appear.\n[14] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to\n clustering with side-information. In S. Thrun S. Becker and K. Obermayer, editors, Advances in\n Neural Information Processing Systems 15, pages 505512. MIT Press, Cambridge, MA, 2003.\n\n\f\n", "award": [], "sourceid": 2596, "authors": [{"given_name": "Koji", "family_name": "Tsuda", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}