{"title": "Support Vector Classification with Input Data Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 161, "page_last": 168, "abstract": null, "full_text": "Support Vector Classi\ufb01cation with Input Data\n\nUncertainty\n\nComputer-Aided Diagnosis & Therapy Group\n\nIBM T. J. Watson Research Center\n\nJinbo Bi\n\nTong Zhang\n\nSiemens Medical Solutions, Inc.\n\nMalvern, PA 19355\n\nYorktown Heights, NY 10598\ntzhang@watson.ibm.com\n\njinbo.bi@siemens.com\n\nAbstract\n\nThis paper investigates a new learning model in which the input data\nis corrupted with noise. We present a general statistical framework to\ntackle this problem. Based on the statistical reasoning, we propose a\nnovel formulation of support vector classi\ufb01cation, which allows uncer-\ntainty in input data. We derive an intuitive geometric interpretation of\nthe proposed formulation, and develop algorithms to ef\ufb01ciently solve it.\nEmpirical results are included to show that the newly formed method is\nsuperior to the standard SVM for problems with noisy input.\n\n1\n\nIntroduction\n\nIn the traditional formulation of supervised learning, we seek a predictor that maps input\nx to output y. The predictor is constructed from a set of training examples f(xi; yi)g. A\nhidden underlying assumption is that errors are con\ufb01ned to the output y. That is, the input\ndata are not corrupted with noise; or even when noise is present in the data, its effect is\nignored in the learning formulation.\n\nHowever, for many applications, this assumption is unrealistic. Sampling errors, modeling\nerrors and instrument errors may preclude the possibility of knowing the input data exactly.\nFor example, in the problem of classifying sentences from speech recognition outputs for\ncall-routing applications, the speech recognition system may make errors so that the ob-\nserved text is corrupted with noise.\nIn image classi\ufb01cation applications, some features\nmay rely on image processing outputs that introduce errors. Hence classi\ufb01cation problems\nbased on the observed text or image features have noisy inputs. Moreover, many systems\ncan provide estimates for the reliability of their outputs, which measure how uncertain each\nelement of the outputs is. This con\ufb01dence information, typically ignored in the traditional\nlearning formulations, can be useful and should be considered in the learning formulation.\n\nA plausible approach for dealing with noisy input is to use the standard learning formula-\ntion without modeling the underlying input uncertainty. If we assume that the same noise\nis observed both in the training data and in the test data, then the noise will cause similar\neffects in the training and testing phases. Based on this (non-rigorous) reasoning, one can\nargue that the issue of input noise may be ignored. However, we show in this paper that by\nmodeling input uncertainty, we can obtain more accurate predictors.\n\n\f2 Statistical models for prediction problems with uncertain input\n\ni be the original uncorrupted\nConsider (xi; yi), where xi is corrupted with noise. Let x0\ni; yi) is generated ac-\ninput. We consider the following data generating process: \ufb01rst (x0\ni; yij(cid:18)), where (cid:18) is an unknown parameter that should be es-\ncording to a distribution p(x0\ni (but\ntimated from the data; next, given (x0\nindependent of yi) according to a distribution p(xij(cid:18)0; (cid:27)i; x0\ni), where (cid:18)0 is another possibly\nunknown parameter, and (cid:27)i is a known parameter which is our estimate of the uncertainty\n(e.g. variance) for xi. The joint probability of (x0\n\ni; yi), we assume that xi is generated from x0\n\ni; xi; yi) can be written as:\n\np(x0\n\ni; xi; yi) = p(x0\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\ni):\n\nThe joint probability of (xi; yi) is obtained by integrating out the unobserved quantity x0\ni:\n\np(xi; yi) = Z p(x0\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\ni)dx0\ni:\n\nThis model can be considered as a mixture model where each mixture component corre-\nsponds to a possible true input x0\ni not observed. In this framework, the unknown parameter\n((cid:18); (cid:18)0) can be estimated from the data using the maximum-likelihood estimate as:\n\nmax\n\n(cid:18);(cid:18)0 Xi\n\nln p(xi; yij(cid:18); (cid:18)0) = max\n\n(cid:18);(cid:18)0 Xi\n\nlnZ p(x0\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\ni)dx0\ni:\n\n(1)\n\nAlthough this is a principled approach under our data generation process, due to the in-\ntegration over the unknown true input x0\ni, it often leads to a very complicated formulation\nwhich is dif\ufb01cult to solve. Moreover, it is not straight-forward to extend the method to non-\nprobability formulations such as support vector machines. Therefore we shall consider an\nalternative that is computationally more tractable and easier to generalize. The method we\nemploy in this paper can be regarded as an approximation to (1), often used in engineering\napplications as a heuristics for mixture estimation. In this method, we simply regard each\ni as a parameter of the probability model, so the maximum-likelihood becomes:\nx0\n\nmax\n\n(cid:18);(cid:18)0 Xi\n\nln sup\n\nx\n\n0\ni\n\n[p(x0\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\ni)]:\n\n(2)\n\nIf our probability model is correctly speci\ufb01ed, then (1) is the preferred formulation. How-\never in practice, we may not know the exact p(xij(cid:18)0; (cid:27)i; x0\ni) (for example, we may not\nbe able to estimate the level of uncertainty (cid:27)i accurately). Therefore in practice, under\nmis-speci\ufb01ed probability models, (1) is not necessarily always a better method.\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\nIntuitively (1) and (2) have similar effects since large values of p(x0\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\ni)\ndominate the summation in R p(x0\ni. That is, both methods prefer\na parameter con\ufb01guration such that the product p(x0\ni) is large for some\ni. If an observation xi is contaminated with large noise so that p(xij(cid:18)0; (cid:27)i; x0\ni) has a \ufb02at\nx0\nshape, then we can pick a x0\ni that is very different from xi which predicts yi well. On the\nother hand, if an observation xi is contaminated with very small noise, then (1) and (2)\npenalize a parameter (cid:18) such that p(xi; yij(cid:18)) is small. This has the effect of ignoring data\nthat are very uncertain and relying on data that are less contaminated.\n\ni; yij(cid:18))p(xij(cid:18)0; (cid:27)i; x0\n\ni)dx0\n\nIn the literature, there are two types of statistical models: generative models and discrim-\ninative models (conditional models). We focus on discriminative modeling in this paper\nsince it usually leads to better prediction performance.\nIn discriminative modeling, we\nassume that p(x0\nAs an example, we consider regression problems with Gaussian noise:\n\ni; yij(cid:18)) has a form p(x0\n\ni; yij(cid:18)) = p(x0\n\ni)p(yij(cid:18); x0\n\ni).\n\np(x0\n\ni; yij(cid:18)) (cid:24) p(x0\n\ni) exp(cid:18)(cid:0)\n\n((cid:18)T x0\n\ni (cid:0) yi)2\n2(cid:27)2\n\n(cid:19) ;\n\np(xij(cid:18)0; (cid:27)i; x0\n\ni) (cid:24) exp(cid:18)(cid:0)\n\nik2\n\nkxi (cid:0) x0\n2(cid:27)2\ni\n\n(cid:19) :\n\n\fThe method in (2) becomes\n\n(cid:18) = arg min\n\n(cid:18) Xi\n\ni (cid:20) ((cid:18)T x0\n\ni (cid:0) yi)2\n2(cid:27)2\n\n0\n\ninf\n\nx\n\n+\n\nik2\n\nkxi (cid:0) x0\n2(cid:27)2\ni\n\n(cid:21) :\n\n(3)\n\nThis formulation is closely related (but not identical) to the so-called total least squares\n(TLS) method [6, 5]. The motivation for total least squares is the same as what we con-\nsider in this paper: input data are contaminated with noise. Unlike the statistical modeling\napproach we adopted in this section, the total least squares algorithm is derived from a\nnumerical computation point of view. The resulting formulation is similar to (3), but its\nsolution can be conveniently described by a matrix SVD decomposition. The method has\nbeen widely applied in engineering applications, and is known to give better performance\nthan the standard least squares method for problems with uncertain inputs. In our frame-\nwork, we can regard (3) as the underlying statistical model for total least squares.\nFor binary classi\ufb01cation where yi 2 f(cid:6)1g, we consider logistic conditional probability\nmodel for yi, while still assume Gaussian noise in the input:\n\np(x0\n\ni; yij(cid:18)) (cid:24) p(x0\ni)\n\n1\n\n1 + exp((cid:0)(cid:18)T x0\n\niyi)\n\n;\n\np(xij(cid:18)0; (cid:27)i; x0\n\ni) (cid:24) exp(cid:18)(cid:0)\n\nik2\n\nkxi (cid:0) x0\n2(cid:27)2\ni\n\n(cid:19) :\n\nSimilar to the total least squares method (3), we obtain the following formulation from (2):\n\n(cid:18) = arg min\n\n(cid:18) Xi\n\ni (cid:20)ln(1 + e(cid:0)(cid:18)T\n\n0\n\ninf\n\nx\n\nx\n\n0\n\niyi ) +\n\nik2\n\nkxi (cid:0) x0\n2(cid:27)2\ni\n\n(cid:21) :\n\n(4)\n\nA well-known disadvantage of logistic model for binary classi\ufb01cation is that it does not\nmodel deterministic conditional probability (that is, p(y = 1jx) = 0; 1) very well. This\nproblem can be remedied using the support vector machine formulation, which has attrac-\ntive intuitive geometric interpretations for linearly separable problems. Although in this\nsection a statistical modeling approach is used to gain useful insights, we will focus on\nsupport vector machines in the rest of the paper.\n\n3 Total support vector classi\ufb01cation\n\nOur formulation of support vector classi\ufb01cation with uncertain input data is motivated by\nthe total least squares regression method that can be derived from the statistical model (3).\nWe thus call the proposed algorithm total support vector classi\ufb01cation (TSVC) algorithm.\nWe assume that inputs are subject to an additive noise, i.e., x0\ni = xi + (cid:1)xi where noise\n(cid:1)xi follows certain distribution. Bounded and ellipsoidal uncertainties are often discussed\nin the TLS context [7], and resulting approaches \ufb01nd many real-life applications. Hence\ninstead of assuming Gaussian noise as in (3) and (4), we consider a simple bounded un-\ncertainty model k(cid:1)xik (cid:20) (cid:14)i with uniform priors. The bound (cid:14)i has a similar effect of\nthe standard deviation (cid:27)i in the Gaussian noise model. However, under the bounded un-\nis replaced by a constraint\ncertainty model, the squared penalty term jjxi (cid:0) x0\nk(cid:1)xik (cid:20) (cid:14)i. Another reason for us to use the bounded uncertainty noise model is that the\nresulting formulation has a more intuitive geometric interpretation (see Section 4).\n\nijj2=2(cid:27)2\ni\n\nSVMs construct classi\ufb01ers based on separating hyperplanes fx : wT x+b = 0g. Hence the\nparameter (cid:18) in (3) and (4) is replaced by a weight vector w and a bias b. In the separable\ncase, TSVC solves the following problem:\n\nmin\n\nw;b;(cid:1)xi;i=1;(cid:1)(cid:1)(cid:1) ;\u2018\n\nsubject to\n\n1\n\n2 kwk2\nyi(cid:0)wT (xi + (cid:1)xi) + b(cid:1) (cid:21) 1; k(cid:1)xik (cid:20) (cid:14)i; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018:\n\n(5)\n\n\fFor non-separable problems, we follow the standard practice of introducing slack variables\n(cid:24)i, one for each data point. In the resulting formulation, we simply replace the square loss in\n(3) or the logistic loss in (4) by the margin-based hinge-loss (cid:24) = maxf0; 1 (cid:0) y(wT x + b)g,\nwhich is used in the standard SVC.\n\nmin\n\nw;b;(cid:24);(cid:1)xi;i=1;(cid:1)(cid:1)(cid:1) ;\u2018\n\nsubject to\n\n2 kwk2\n\ni=1 (cid:24)i + 1\n\nCP\u2018\nyi(cid:0)wT (xi + (cid:1)xi) + b(cid:1) (cid:21) 1 (cid:0) (cid:24)i; (cid:24)i (cid:21) 0; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018;\n\nk(cid:1)xik (cid:20) (cid:14)i; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018:\n\n(6)\n\n2 as usually em-\nNote that we introduced the standard Tikhonov regularization term 1\nployed in SVMs. The effect is similar to a Gaussian prior in (3) and (4) with the Bayesian\nMAP (maximum a posterior) estimator. One can regard (6) as a regularized instance of (2)\nwith a non-probabilistic SVM discriminative loss criterion.\n\n2 kwk2\n\nProblems with corrupted inputs are more dif\ufb01cult than problems with no input uncertainty.\nEven if there is a large margin separator for the original uncorrupted inputs, the observed\nnoisy data may become non-separable. By modifying the noisy input data as in (6), we\nreconstruct an easier problem, for which we may \ufb01nd a good linear separator. Moreover,\nby modeling noise in the input data, TSVC becomes less sensitive to data points that are\nvery uncertain since we can \ufb01nd a choice of (cid:1)xi such that xi +(cid:1)xi is far from the decision\nboundary and will not be a support vector. This is illustrated later in Figure 1 (right). TSVC\nthus constructs classi\ufb01ers by focusing on the more trust-worthy data that are less uncertain.\n\n4 Geometric interpretation\n\nFurther investigation reveals an intuitive geometric interpretation for TSVC which allows\nusers to easily grasp the fundamentals of this new formulation. We \ufb01rst derive the following\nfact that when the optimal ^w is obtained, the optimal (cid:1)^xi can be represented in terms of\n\n^w. If w is \ufb01xed in problem (6), optimizing problem (6) is equivalent to minimizing P (cid:24)i\n\nover (cid:1)xi. The following lemma characterizes the solution.\nLemma 1. For any given hyperplane (w; b), the solution (cid:1)^xi of problem (6) is (cid:1)^xi =\nyi(cid:14)i\n\nkwk , i = 1; (cid:1) (cid:1) (cid:1) ; \u2018.\n\nw\n\nProof. Since the noise vector (cid:1)xi only affects (cid:24)i and does not have impact on other\n\nslack variables (cid:24)j; j 6= i. The minimization of P (cid:24)i can be decoupled into \u2018 subproblems\n\nof minimizing each (cid:24)i = maxf0; 1 (cid:0) yi(wT (xi + (cid:1)xi) + b)g = maxf0; 1 (cid:0) yi(wT xi +\nb) (cid:0) yiwT (cid:1)xig over its corresponding (cid:1)xi. By the Cauchy-Schwarz inequality, we have\njyiwT (cid:1)xij (cid:20) kwk (cid:1) k(cid:1)xik with equality if and only if (cid:1)xi = cw for some scalar c. Since\nkwk and the minimal ^(cid:24)i = maxf0; 1 (cid:0)\n(cid:1)xi is bounded by (cid:14)i, the optimal (cid:1)^xi = yi(cid:14)i\nyi(wT xi + b) (cid:0) (cid:14)ikwkg.\n\nw\n\nw\n\nDe\ufb01ne Sw(X) = fxi + yi(cid:14)i\nkwk ; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018g. Then Sw(X) is a set of points that are\nobtained by shifting the original points labeled +1 along w and points labeled (cid:0)1 along\n(cid:0)w, respectively, to its individual uncertainty boundary. These shifted points are illustrated\nin Figure 1(middle) as \ufb01lled points.\nTheorem 1. The optimal hyperplane ( ^w; ^b) obtained by TSVC (5) separates S ^w(X) with\nthe maximal margin. The optimal hyperplane ( ^w; ^b) obtained by TSVC (6) separates\nS ^w(X) with the maximal soft margin.\n\nProof. 1.\nIf there exists any w such that Sw(X) is linearly separable, we can solve\nproblem (5) to obtain the largest separation margin. Let ( ^w; ^b; (cid:1) ^xi) be optimal to problem\n(5). Note that solving problem (5) is equivalent to max (cid:26) subject to constraints yi(wT (xi +\n\n\fw\n\nw\n\nFigure 1: The separating hyperplanes obtained (left) by standard SVC and (middle) by total\nSVC (6). The margin can be magni\ufb01ed by taking into account uncertainties. Right: TSVC\nsolution is less sensitive to outliers with large noise.\n\n(cid:1)xi) + b) (cid:21) (cid:26) and kwk = 1, so the optimal (cid:26) = 1\nk ^wk [8]. To have the greatest (cid:26), we want\nto max yi( ^wT (xi + (cid:1)xi) + ^b) for all i\u2019s over (cid:1)xi. Hence following similar arguments\nin Lemma 1, we have jyi ^wT (cid:1)xij (cid:20) k ^wkk(cid:1)xik = (cid:14)ik ^wk and when (cid:1) ^xi = yi(cid:14)i\nk ^wk , the\n\u201cequal\u201d sign holds.\n\n^w\n\n2. If no w exists to separate Sw(X) or even when such a w exists, we may solve problem\n(6) to achieve the best compromise between the training error and the margin size. Let ^w\nbe optimal to problem (6). By Lemma 1, the optimal (cid:1)^xi = yi(cid:14)i\n\n^w\n\nk ^wk .\n\nAccording to the above analysis, we can convert problems (5) and (6) to a problem in\nvariables w; b; (cid:24), as opposed to optimizing over both (w; b; (cid:24)) and (cid:1)xi; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018. For\nexample, the linearly non-separable problem (6) becomes\n\nmin\nw;b;(cid:24)\n\nCP\u2018\n\ni=1 (cid:24)i + 1\n\n2 kwk2\n\nsubject to yi(cid:0)wT xi + b(cid:1) + (cid:14)ikwk (cid:21) 1 (cid:0) (cid:24)i; (cid:24)i (cid:21) 0; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018:\n\nSolving problem (7) yields an optimal solution to problem (6), and problem (7) can be\ninterpreted as \ufb01nding (w; b) to separate Sw(X) with the maximal soft margin. The similar\nargument holds true for the linearly separable case.\n\n(7)\n\n5 Solving and kernelizing TSVC\n\nTSVC problem (6) can be recast to a second-order cone program (SOCP) as usually done\nin TLS or Robust LS methods [7, 4]. However, directly implementing this SOCP will be\ncomputationally quite expensive. Moreover, the SOCP formulation involves a large amount\nof redundant variables, so a typical SOCP solver will take much longer time to achieve an\noptimal solution. We propose a simple iterative approach as follows based on alternating\noptimization method [1].\nAlgorithm 1\n\nInitialize (cid:1)xi = 0, repeat the following two steps until a termination criterion is met:\n1. Fix (cid:1)xi; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018 to the current value, solve problem (6) for w, b, and (cid:24).\n2. Fix w; b to the current value, solve problem (6) for (cid:1)xi, i = 1; (cid:1) (cid:1) (cid:1) ; \u2018, and (cid:24).\n\nThe \ufb01rst step of Algorithm 1 solves no more than a standard SVM by treating xi + (cid:1)xi as\nthe training examples. Similar to how SVMs are usually optimized, we can solve the dual\nSVM formulation [8] for ^w; ^b. The second step of Algorithm 1 solves a problem which has\nbeen discussed in Lemma 1. No optimization solver is needed. The solution (cid:1)xi of the\nsecond step has a closed form in terms of the \ufb01xed w.\n\n\f5.1 TSVC with linear functions\n\nWhen only linear functions are considered, an alternative exists to solve problem (6) other\n2 kwk2\n\nthan Algorithm 1. As analyzed in [5, 3], Tikhonov regularization min C P (cid:24)i + 1\nhas an important equivalent formulation as minP (cid:24)i; subject to kwk (cid:20) (cid:13) where (cid:13) is a\n\npositive constant. It can be shown that if (cid:13) (cid:20) kw(cid:3)k where w(cid:3) is the solution to problem\n(6) with 1\n2 kwk2 removed, then the solution for the constraint problem is identical to the so-\nlution of the Tikhonov regularization problem for an appropriately chosen C. Furthermore,\nat optimality, the constraint kwk (cid:20) (cid:13) is active, which means k ^wk = (cid:13). Hence TSVC prob-\nlem (7) can be converted to a simple SOCP with the constraint kwk (cid:20) (cid:13) or a quadratically\nconstrained quadratic program (QCQP) as follows if equivalently using kwk2 (cid:20) (cid:13) 2.\n\nmin\n\ni=1 (cid:24)i\n\nw;b;(cid:24) P\u2018\nsubject to yi(cid:0)wT xi + b(cid:1) + (cid:13)(cid:14)i (cid:21) 1 (cid:0) (cid:24)i; (cid:24)i (cid:21) 0; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018; kwk2 (cid:20) (cid:13) 2:\n\nThis QCQP produces exactly the same solution as problem (6) but is much easier to imple-\nment than (6) since it contains much less variables. By duality analysis similarly adopted\nin [3], problem (8) has a dual formulation in dual variables (cid:11) as follows\n\n(8)\n\n(cid:11)\n\nmin\n\n(cid:13)qP\u2018\nsubject to P\u2018\n\ni=1 (cid:11)iyi = 0;\n\ni;j=1 (cid:11)i(cid:11)jyiyj xT\n\ni\n\nxj (cid:0)P\u2018\n\ni=1(1 (cid:0) (cid:13)(cid:14)i)(cid:11)i\n0 (cid:20) (cid:11)i (cid:20) 1; i = 1; (cid:1) (cid:1) (cid:1) ; \u2018:\n\n(9)\n\n5.2 TSVC with kernels\n\nBy using a kernel function k, the input vector xi is mapped to (cid:8)(xi) in a usually high\ndimensional feature space. The uncertainty in the input data introduces uncertainties for\nimages (cid:8)(xi) in the feature space. TSVC can be generalized to construct separating hyper-\nplanes in the feature space using the images of input vectors and the mapped uncertainties.\nOne possible generalization of TSVC is to assume the images are still subject to an additive\nnoise and the uncertainty model in the feature space can be represented as k(cid:1)(cid:8)(xi)k (cid:20) (cid:14)i.\nThen following the similar analysis in Sections 4 and 5.1, we obtain a problem same as (8)\nonly with xi replaced by (cid:8)(xi) and (cid:1)xi replaced by (cid:1)(cid:8)(xi), which can be easily kernel-\nized by solving its dual formulation (9) with inner products xT\ni\n\nxj replaced by k(xi; xj).\n\nIt is more realistic, however, that we are only able to estimate uncertainties in the input\nspace as bounded spheres k(cid:1)xik (cid:20) (cid:14)i. When the uncertainty sphere is mapped to the\nfeature space, the mapped uncertainty region may correspond to an irregular shape in the\nfeature space, which brings dif\ufb01culties to the optimization of TSVC. We thus propose an\napproximation strategy for Algorithm 1 based on the \ufb01rst order Taylor expansion of k.\nA kernel function k(x; z) takes two arguments x and z. When we \ufb01x one of the arguments,\nfor example z, k can be viewed as a function of the other argument x. The \ufb01rst order Taylor\nexpansion of k with respect to x is k(xi + (cid:1)x; (cid:1)) = k(xi; (cid:1)) + (cid:1)xT k0(xi; (cid:1)) where k0(xi; (cid:1))\nis the gradient of k with respect to x at point xi.\nSolving the dual SVM formulation in step 1 of Algorithm 1 with (cid:1)xj \ufb01xed to (cid:1)(cid:22)xj yields\n\na solution ( (cid:22)w = Pj yj (cid:22)(cid:11)j(cid:8)(xj + (cid:1)(cid:22)xj); (cid:22)b) and thus a predictor f (x) = Pj yj (cid:22)(cid:11)jk(x; xj +\n(cid:1)(cid:22)xj) + (cid:22)b. In step 2, we set (w; b) to ( (cid:22)w; (cid:22)b) and minimize P (cid:24)i over (cid:1)xi, which as we\ndiscussed in Lemma 1, amounts to minimizing each (cid:24)i = maxf0; 1 (cid:0) yi(Pj yj (cid:22)(cid:11)jk(xi +\n\n(cid:1)xi; xj + (cid:1)(cid:22)xj) + b)g over (cid:1)xi. Applying the Taylor expansion yields\n\nyi(cid:16)Pj yj (cid:22)(cid:11)jk(xi + (cid:1)xi; xj + (cid:1)(cid:22)xj) + b(cid:17)\n= yi(cid:16)Pj yj (cid:22)(cid:11)jk(xi; xj + (cid:1)(cid:22)xj) + b(cid:17) + yi(cid:1)xT\n\ni Pj yj (cid:22)(cid:11)jk0(xi; xj + (cid:1)(cid:22)xj):\n\n\fTable 1: Average test error percentages of TSVC and standard SVC algorithms on synthetic\nproblems (left and middle ) and digits classi\ufb01cation problems (right).\n\nSynthetic linear target\n20\n100\n\u2018\n2.9\nSVC\n8.9\nTSVC 6.1\n2.1\n\n30\n7.8\n5.2\n\n50\n5.5\n3.8\n\n150\n2.1\n1.6\n\nSynthetic quadratic target\n20\n9.9\n7.9\n\n100\n3.2\n2.8\n\n30\n7.5\n6.1\n\n50\n6.7\n4.4\n\nDigits\n100\n24.35\n23.00\n\n500\n18.91\n16.10\n\n150\n2.8\n2.4\n\nThe optimal (cid:1)xi = yi(cid:14)i\ninequality. A closed-form approximate solution for the second step is thus acquired.\n\nkvik where vi = P yj (cid:22)(cid:11)jk0(xi; xj+(cid:1)(cid:22)xj) by the Cauchy-Schwarz\n\nvi\n\n6 Experiments\n\nTwo sets of simulations were performed, one on synthetic datasets and one on NIST hand-\nwritten digits, to validate the proposed TSVC algorithm. We used the commercial opti-\nmization package ILOG CPLEX 9.0 to solve problems (8), (9) and the standard SVC dual\nproblem that is part of Algorithm 1.\n\n1+x2\n\nIn the experiments with synthetic data in 2 dimensions, we generated \u2018 (=20, 30, 50, 100,\n150) training examples xi from the uniform distribution on [(cid:0)5; 5]2. Two binary classi\ufb01ca-\ntion problems were created with target separating functions as x1(cid:0)x2 = 0 and x2\n2 = 9,\nrespectively. We used TSVC with linear functions for the \ufb01rst problem and TSVC with the\nquadratic kernel (xT\nxj)2 for the second problem. The input vectors xi were contaminated\ni\nby Gaussian noise with mean [0,0] and covariance matrix (cid:6) = (cid:27)iI where (cid:27)i was randomly\nchosen from [0:1; 0:8]. The matrix I denotes the 2 (cid:2) 2 identity matrix. To produce an\noutlier effect, we randomly chose 0:1\u2018 examples from the \ufb01rst 0:2\u2018 examples after exam-\nples were ordered in an ascending order of their distances to the target boundary. For these\n0:1\u2018 examples, noise was generated using a larger (cid:27) randomly drawn from [0:5; 2]. Models\nobtained by the standard SVC and TSVC were tested on a test set of 10000 examples that\nwere generated from the same distribution and target functions but without contamination.\nWe performed 50 trials for each experimental setting. The misclassi\ufb01cation error rates av-\neraged over the 50 trials are reported in Table 1. TSVC performed overall better than SVC.\nTwo representative modeling results of \u2018 = 50 are also visually depicted in Figure 2.\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n\u22125\n\n0\n\n5\n\nFigure 2: Results obtained by TSVC (solid lines) and standard SVC (dash lines) for the\nproblem with (left) a linear target function and the problem with (right) a quadratic target\nfunction. The true target functions are illustrated using dash-dot lines.\n\nThe NIST database of handwritten digits does not contain any uncertainty information\n\n\foriginally. We created uncertainties by image distortions. Different types of distortions\ncan present in real-life data. We simulated it only by rotating images. We used \u2018 (=100,\n500) digits from the beginning of the database in training and 2000 digits from the end\nof the database in test. We discriminated between odd numbers and even numbers. The\nangle of rotation for each digit was randomly chosen from [(cid:0)8o; 8o]. The uncertainty\nupper bounds (cid:14)i can be regarded as tuning parameters. We simply set all (cid:14)i = (cid:14). The data\nwas preprocessed in the following way: training examples were centered to have mean 0\nand scaled to have standard deviation 1. The test data was preprocessed using the mean\nand standard deviation of training examples. We performed 50 trials with TSVC and SVC\nusing the linear kernel, which means we need to solve problem (9). Results are reported\nin Table 1 and the tuned parameter (cid:14) was 1.38 for \u2018 = 100 and 1.43 for \u2018 = 500. We\nconjecture that TSVC performance can be further improved if we obtain an estimate of (cid:14)i.\n\n7 Discussions\n\nWe investigated a new learning model in which the observed input is corrupted with noise.\nBased on a probability modeling approach, we derived a general statistical formulation\nwhere unobserved input is modeled as a hidden mixture component. Under this framework,\nwe were able to develop estimation methods that take input uncertainty into consideration.\nMotivated by this probability modeling approach, we proposed a new SVM classi\ufb01cation\nformulation that handles input uncertainty. This formulation has an intuitive geometric\ninterpretation. Moreover, we presented simple numerical algorithms which can be used to\nsolve the resulting formulation ef\ufb01ciently. Two empirical examples, one arti\ufb01cial and one\nwith real data, were used to illustrate that the new method is superior to the standard SVM\nfor problems with noisy input data. A related approach, with a different focus, is presented\nin [2]. Our work attempts to recover the original classi\ufb01er from the corrupted training data,\nand hence we evaluated the performance on clean test data. In our statistical modeling\nframework, rigorously speaking, the input uncertainty of test-data should be handled by a\nmixture model (or a voted classi\ufb01er under the noisy input distribution). The formulation\nin [2] was designed to separate the training data under the worst input noise con\ufb01guration\ninstead of the most likely con\ufb01guration in our case. The purpose is to directly handle test\ninput uncertainty with a single linear classi\ufb01er under the worst possible error setting. The\nrelationship and advantages of these different approaches require further investigation.\n\nReferences\n[1] J. Bezdek and R. Hathaway. Convergence of alternating optimization. Neural, Parallel Sci.\n\nComput., 11:351\u2013368, 2003.\n\n[2] C. Bhattacharyya, K.S. Pannagadatta, and A. J. Smola. A second order cone programming for-\n\nmulation for classifying missing data. In NIPS, Vol 17, 2005.\n\n[3] J. Bi and V. N. Vapnik. Learning with rigorous support vector machines. In M. Warmuth and\nB. Sch\u00a8olkopf, editors, Proceedings of the 16th Annual Conference on Learning Theory, pages\n35\u201342, Menlo Park, CA, 2003. AAAI Press.\n\n[4] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data.\n\nSIAM Journal on Matrix Analysis and Applications, 18:1035\u20131064, 1997.\n\n[5] G. H. Golub, P. C. Hansen, and D. P. O\u2019Leary. Tikhonov regularization and total least squares.\n\nSIAM Journal on Numerical Analysis, 30:185\u2013194, 1999.\n\n[6] G. H. Golub and C. F. Van Loan. An analysis of the total least squares problem. SIAM Journal\n\non Numerical Analysis, 17:883\u2013893, 1980.\n\n[7] S. Van Huffel and J. Vandewalle. The Total Least Squares Problem: Computational Aspects and\n\nAnalysis, in Frontiers in Applied Mathematics 9. SIAM Press, Philadelphia, PA, 1991.\n[8] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., New York, 1998.\n\n\f", "award": [], "sourceid": 2743, "authors": [{"given_name": "Jinbo", "family_name": "Bi", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}