{"title": "A Mathematical Programming Approach to the Kernel Fisher Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 591, "page_last": 597, "abstract": null, "full_text": "A Mathematical Programming Approach to the \n\nKernel Fisher Algorithm \n\nSebastian Mika*, Gunnar Ratsch*, and Klaus-Robert Miiller*+ \n\n*GMD FIRST.lDA, KekulestraBe 7, 12489 Berlin, Germany \n+University of Potsdam, Am Neuen Palais 10, 14469 Potsdam \n\n{mika, raetsch, klaus}@jirst.gmd.de \n\nAbstract \n\nWe investigate a new kernel-based classifier: the Kernel Fisher Discrim(cid:173)\ninant (KFD). A mathematical programming formulation based on the ob(cid:173)\nservation that KFD maximizes the average margin permits an interesting \nmodification of the original KFD algorithm yielding the sparse KFD. We \nfind that both, KFD and the proposed sparse KFD, can be understood \nin an unifying probabilistic context. Furthermore, we show connections \nto Support Vector Machines and Relevance Vector Machines. From this \nunderstanding, we are able to outline an interesting kernel-regression \ntechnique based upon the KFD algorithm. Simulations support the use(cid:173)\nfulness of our approach. \n\n1 Introduction \n\nRecent years have shown an enormous interest in kernel-based classification algorithms, \nprimarily in Support Vector Machines (SVM) [2]. The success of SVMs seems to be trig(cid:173)\ngered by (i) their good generalization performance, (ii) the existence of a unique solution, \nand (iii) the strong theoretical background: structural risk minimization [12], supporting \nthe good empirical results. One of the key ingredients responsible for this success is the \nuse of Mercer kernels, allowing for nonlinear decision surfaces which even might incorpo(cid:173)\nrate some prior knowledge about the problem to solve. For our purpose, a Mercer kernel \ncan be defined as a function k : IRn x IRn --+ IR, for which some (nonlinear) mapping \n~ : IRn --+ F into afeature ,space F exists, such that k(x, y) = (~(x) . ~(y)). Clearly, the \nuse of such kernel functions is not limited to SVMs. The interpretation as a dot-product \nin another space makes it particularly easy to develop new algorithms: take any (usually) \nlinear method and reformulate it using training samples only in dot-products, which are \nthen replaced by the kernel. Examples thereof, among others, are Kernel-PCA [9] and the \nKernel Fisher Discriminant (KFD [4]; see also [8, 1]). \nIn this article we consider algorithmic ideas for KFD. Interestingly KFD - although ex(cid:173)\nhibiting a similarly good performance as SVMs - has no explicit concept of a margin. This \nis noteworthy since the margin is often regarded as explanation for good generalization \nin SVMs. We will give an alternative formulation of KFD which makes the difference \nbetween both techniques explicit and allows a better understanding of the algorithms. An(cid:173)\nother advantage of the new formulation is that we can derive more efficient algorithms for \noptimizing KFDs, that have e.g. sparseness properties or can be used for regression. \n\n\f2 A Review of Kernel Fisher Discriminant \n\nThe idea of the KFD is to solve the problem of Fisher's linear discriminant in a kernel \nfeature space F , thereby yielding a nonlinear discriminant in the input space. First we \nfix some notation. Let {Xi Ii = 1, ... ,e} be our training sample and y E {-1, 1}l be \nthe vector of corresponding labels. Furthermore define 1 E ~l as the vector of all ones, \n11 ,1 2 E ~l as binary (0,1) vectors corresponding to the class labels and let I, I l , andI2 \nbe appropriate index sets over e and the two classes, respectively (with ei = IIil). \nIn the linear case, Fisher's discriminant is computed by maximizing the coefficient J( w) = \n(WTSBW)/(WTSww) of between and within class variance, i.e. SB = (m2 - mt)(m2-\nmll and Sw = Lk=1,2 LiEIk (Xi - mk)(Xi - mkl, where mk denotes the sample \nmean for class k. To solve the problem in a kernel feature space F one needs a formulation \nwhich makes use of the training samples only in terms of dot-products. One first shows \n[4], that there exists an expansion for w E F in terms of mapped training patterns, i.e. \n\n(1) \n\nUsing some straight forward algebra, the optimization problem for the KFD can then be \nwritten as [5]: \n\n(o.TIL) 2 \n\no.TMo. \nJ(o.) = o.TNo. = o.TNo.' \n\n(2) \nwhere ILi = t K1 i' N = KKT - Li=1 ,2eiILiILY, IL = IL2 -\nILl' M = ILILT, and \nKij = (
(x)) + b, where w is given by (1). Assuming a Gaussian noise model with \nvariance u the likelihood can be written as \n\np(ylo:, u 2) = exp( - 2u2 L)(w .