{"title": "Convex Learning with Invariances", "book": "Advances in Neural Information Processing Systems", "page_first": 1489, "page_last": 1496, "abstract": null, "full_text": "Convex Learning with Invariances\n\nChoon Hui Teo\n\nAustralian National University\n\nAmir Globerson\n\nCSAIL, MIT\n\nchoonhui.teo@anu.edu.au\n\ngamir@csail.mit.edu\n\nSam Roweis\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nAlexander J. Smola\n\nNICTA\n\nCanberra, Australia\n\nroweis@cs.toronto.edu\n\nalex.smola@gmail.com\n\nAbstract\n\nIncorporating invariances into a learning algorithm is a common problem in ma-\nchine learning. We provide a convex formulation which can deal with arbitrary\nloss functions and arbitrary losses. In addition, it is a drop-in replacement for most\noptimization algorithms for kernels, including solvers of the SVMStruct family.\nThe advantage of our setting is that it relies on column generation instead of mod-\nifying the underlying optimization problem directly.\n\n1 Introduction\n\nInvariances are one of the most powerful forms of prior knowledge in machine learning; they have a\nlong history [9, 1] and their application has been associated with some of the major success stories\nin pattern recognition. For instance, the insight that in vision tasks, one should be often be designing\ndetectors that are invariant with respect to translation, small degrees of rotation & scaling, and image\nintensity has led to best-in-class algorithms including tangent-distance [13], virtual support vectors\n[5] and others [6].\n\nIn recent years a number of authors have attempted to put learning with invariances on a solid math-\nematical footing. For instance, [3] discusses how to extract invariant features for estimation and\nlearning globally invariant estimators for a known class of invariance transforms (preferably arising\nfrom Lie groups). Another mathematically appealing formulation of the problem of learning with\ninvariances casts it as a second order cone programming [8]; unfortunately this is neither particularly\nef\ufb01cient to implement (having worse than cubic scaling behavior) nor does it cover a wide range of\ninvariances in an automatic fashion. A different approach has been to pursue \u201crobust\u201d estimation\nmethods which, roughly speaking, aim to \ufb01nd estimators whose performance does not suffer signif-\nicantly when the observed inputs are degraded in some way. Robust estimation has been applied to\nlearning problems in the context of missing data [2] and to deal with speci\ufb01c type of data corruption\nat test time [7]. The former approach again leads to a second order cone program, limiting its ap-\nplicability to very small datasets; the latter is also computationally demanding and is limited to only\nspeci\ufb01c types of data corruption.\n\nOur goal in this work is to develop a computationally scalable and broadly applicable approach to\nsupervised learning with invariances which is easily adapted to new types of problems and can take\nadvantage of existing optimization infrastructures. In this paper we propose a method which has\nwhat we believe are many appealing properties:\n\n1. It formulates invariant learning as a convex problem and thus can be implemented directly\nusing any existing convex solver, requiring minimal additional memory and inheriting the\nconvergence properties/guarantees of the underlying implementation.\n\n1\n\n\f2. It can deal with arbitrary invariances, including gradual degradations, provided that the\nuser provides a computational recipe to generate invariant equivalents ef\ufb01ciently from a\ngiven data vector.\n\n3. It provides a unifying framework for a number of previous approaches, such as the method\nof Virtual Support Vectors [5] and is broadly applicable not just to binary classi\ufb01cation but\nin fact to any structured estimation problem in the sense of [16].\n\n2 Maximum Margin Loss with Invariances\n\nWe begin by describing a maximum margin formulation of supervised learning which naturally\nincorporates invariance transformations on the input objects. We assume that we are given input\npatterns x \u2208 X from from some space X and that we want to estimate outputs y \u2208 Y. For instance\nY = {\u00b11} corresponds to binary classi\ufb01cation; Y = An corresponds to sequence prediction over\nthe alphabet A.1 We denote our prediction by \u00afy(x), which is obtained by maximizing our learned\nfunction f : X \u00d7 Y \u2192 R, i.e.\n\u00afy(x) := argmaxy\u2208Y f(x, y). For instance, if we are training a\n(generative or discriminative) probabilistic model, f(x, y) = log p(y|x) then our prediction is the\nmaximum a-posteriori estimate of the target y given x. In many interesting cases \u00afy(x) is obtained\nby solving a nontrivial discrete optimization problem, e.g. by means of dynamic programming. In\nkernel methods f(x, y) = h\u03c6(x, y), wi for a suitable feature map \u03c6 and weight vector w. For the\npurpose of our analysis the precise form of f is immaterial, although our experiments focus on the\nkernel machines, due to the availability of scalable optimizers for that class of estimators.\n\n2.1\n\nInvariance Transformations and Invariance Sensitive Cost\n\nThe crucial ingredient to formulating invariant learning is to capture the domain knowledge that there\nexists some class S of invariance transforms s which can act on the input x while leaving the target\ny essentially unchanged. We denote by (s(x), y) s \u2208 S the set of valid transformations of the pair\n(x, y). For instance, we might believe that slight rotation (in pixel coordinates) of an input image in\na pattern recognition problem do not change the image label. For text classi\ufb01cation problems such\nas spam \ufb01ltering, we may believe that certain editing operations (such as changes in capitalization\nor substitutions like Viagra \u2192 V1agra,V!agra) should not affect our decision function. Of\ncourse, most invariances only apply \u201clocally\u201d, i.e. in the neighborhood of the original input vector.\nFor instance, rotating an image of the digit 6 too far might change its label to 9; applying both a\nsubstitution and an insertion can change Viagra \u2192 diagram. Furthermore, certain invariances\nmay only hold for certain pairs of input and target. For example, we might believe that horizontal\nre\ufb02ection is a valid invariance for images of digits in classes 0 and 8 but not for digits in class 2.\nThe set s(x) s \u2208 S incorporates both the locality and applicability constraints. (We have introduced\na slight abuse of notation since s may depend on y but this should always be clear in context.)\n\nTo complete the setup, we adopt the standard assumption that the world or task imposes a cost\nfunction such that if the true target for an input x is y and our prediction is \u00afy(x) we suffer a cost\n\u2206(y, \u00afy(x)).2 For learning with invariances, we extend the de\ufb01nition of \u2206 to include the invariance\nfunction s(x), if any, which was applied to the input object: \u2206(y, \u00afy(s(x)), s). This allows the cost\nto depend on the transformation, for instance we might suffer less cost for poor predictions when\nthe input has undergone very extreme transformations. In a image labeling problem, for example,\nwe might believe that a lighting/exposure invariance applies but we might want to charge small\ncost for extremely over-exposed or under-exposed images since they are almost impossible to label.\nSimilarly, we might assert that scale invariance holds but give small cost to severely spatially down-\nsampled images since they contain very little information.\n\n2.2 Max Margin Invariant Loss\n\nOur approach to the invariant learning problem is very natural, yet allows us to make a surprising\namount of analytical and algorithmic progress. A key quantity is the cost under the worst case\ntransformation for each example, i.e. the transformation under which our predicted target suffers\n\n1For more nontrivial examples see, e.g. [16, 14] and the references therein.\n2Normally \u2206 = 0 if \u00afy(x) = y but this is not strictly necessary.\n\n2\n\n\fthe maximal cost compared with the true target:\n\nC(x, y, f) = sup\ns\u2208S\n\n\u2206(y, \u00afy(s(x)), s)\n\n(1)\n\nThe objective function (loss) that we advocate minimizing during learning is essentially a convex\nupper bound on this worst case cost which incorporates a notion of (scaled) margin:\n\u0393(y, y0)(f(s(x), y0) \u2212 f(s(x), y)) + \u2206(y, y0, s)\n\nl(x, y, f) := sup\n\n(2)\n\ny0\u2208Y,s\u2208S\n\nThis loss function \ufb01nds the combination of invariance transformation and predicted target for which\nthe sum of (scaled) \u201cmargin violation\u201d plus the cost is maximized. The function \u0393(y, y0) is a non-\nnegative margin scaling which allows different target/prediction pairs to impose different amounts\nof loss on the \ufb01nal objective function.3 The numerical scale of \u0393 also sets the regularization tradeoff\nbetween margin violations and the prediction cost \u2206.\nThis loss function has two mathematically important properties which allow us to develop scalable\nand convergent algorithms as proposed above.\n\nLemma 1 The loss l(x, y, f) is convex in f for any choice of \u0393, \u2206 and S.\nProof For \ufb01xed (y0, s) the expression \u0393(y, y0)(f(s(x), y0) \u2212 f(s(x), y)) + \u2206(y, y0, s) is linear in\nf, hence (weakly) convex. Taking the supremum over a set of convex functions yields a convex\nfunction.\nThis means that we can plug l into any convex solver, in particular whenever f belongs to a linear\nfunction class, as is the case with kernel methods. The primal (sub)gradient of l is easy to write:\n\n\u2202f l(x, y, f) = \u0393(y, y\u2217)(\u03c6(s\u2217(x), y\u2217) \u2212 \u03c6(s\u2217(x), y))\n\n(3)\nwhere s\u2217, y\u2217 are values of s, y for which the supremum in Eq. (2) is attained and \u03c6 is the evaluation\nfunctional of f, that is hf, \u03c6(x, y)i = f(x, y). In kernel methods \u03c6 is commonly referred to as the\nfeature map with associated kernel\n\nk((x, y), (x0, y0)) = h\u03c6(x, y), \u03c6(x0, y0)i .\n\n(4)\n\nNote that there is no need to de\ufb01ne S formally. All we need is a computational recipe to obtain the\nworst case s \u2208 S in terms of the scaled margin in Eq. 2. Nor is there any requirement for \u2206(y, y0, s)\nor (s(x), y) to have any particularly appealing mathematical form, such as the polynomial trajectory\nrequired by [8], or the ellipsoidal shape described by [2].\n\nLemma 2 The loss l(x, y, f) provides an upper bound on C(x, y, f) = sups\u2208S \u2206(y, \u00afy(s(x)), s).\nProof Denote by (s\u2217, y\u2217) the values for which the supremum of C(x, y, f) is attained. By construc-\ntion f(s\u2217(x), y\u2217) \u2265 f(s\u2217(x), y). Plugging this inequality into Eq. (2) yields\n\nl(x, y, f) \u2265 \u0393(y, y\u2217)(f(s\u2217(x), y\u2217) \u2212 f(s\u2217(x), y)) + \u2206(y, y\u2217, s\u2217) \u2265 \u2206(y, y\u2217, s\u2217).\n\nHere the \ufb01rst inequality follows by substituting (s\u2217, y\u2217) into the supremum. The second inequality\nfollows from the fact that \u0393 \u2265 0 and that (s\u2217, y\u2217) are the maximizers of the empirical loss.\nThis is essentially a direct extension of [16]. The main modi\ufb01cations are the inclusion of a margin\nscale \u0393 and the use of an invariance transform s(x). In section 4 we clarify how a number of existing\nmethods for dealing with invariances can be viewed as special cases of Eq. (2).\nIn summary, Eq. (2) penalizes estimation errors not only for the observed pair (x, y) but also for\npatterns s(x) which are \u201cnear\u201d x in terms of the invariance transform s. Recall, however, that the\ncost function \u2206 may assign quite a small cost to a transformation s which takes x very far away\nfrom the original. Furthermore, the transformation class is restricted only by the computational\nconsideration that we can ef\ufb01ciently \ufb01nd the \u201cworst case\u201d transformation; S does not have to have\na speci\ufb01c analytic form. Finally, there is no speci\ufb01c restriction on y, thus making the formalism\napplicable to any type of structured estimation.\n\n3Such scaling has been shown to be extremely important and effective in many practical problems especially\nin structured prediction tasks. For example, the key difference between the large margin settings of [14] and\n[16] is the incorporation of a sequence-length dependent margin scaling.\n\n3\n\n\f3 Learning Algorithms for Minimizing Invariant Loss\n\nWe now turn to the question of learning algorithms for our invariant loss function. We assume\nthat we are given a training set of input patterns X = {x1, . . . , xm} and associated labels Y =\n{y1, . . . , ym}. We follow the common approach of minimizing, at training time, our average training\nloss plus a penalty for model complexity. In the context of kernel methods this can be viewed as a\nregularized empirical risk functional of the form\n\nR[f] =\n\n1\nm\n\nl(xi, yi, f) + \u03bb\n2\n\nkfk2\n\nH where f(x, y) = h\u03c6(x, y), wi .\n\nA direct extension of the derivation of [16] yields that the dual of (5) is given by\n\nminimize\n\n\u03b1\n\nsubject to \u03bbm\n\n\u03b1iys\u03b1jy0s0 Kiys,jy0s0 +\n\ns,s0\u2208S\n\u03b1iys = 1 for all i and \u03b1iys \u2265 0.\n\ni=1\n\nmX\n\nX\n\nX\n\ny\u2208Y\n\ns\u2208S\n\n\u2206(yi, y, s)\u03b1iys\n\n(5)\n\n(6a)\n\n(6b)\n\nmX\n\ni=1\n\nX\n\nX\nmX\nX\nX\n\ny,y0\u2208Y\n\ni,j=1\n\ny\u2208Y\n\ns\u2208S\n\nmX\n\nX\n\nX\n\ni=1\n\ny\u2208Y\n\ns\u2208S\n\nHere the entries of the kernel matrix K are given by\n\nKiys,jy0s0 = \u0393(yi, y)\u0393(yj, y0)h\u03c6(s(xi), y) \u2212 \u03c6(s(xi), yi), \u03c6(s0(xj), y0) \u2212 \u03c6(s0(xj), yj)i\n\n(7)\n\nThis can be expanded into four kernel functions by using Eq. (4). Moreover, the connection between\nthe dual coef\ufb01cients \u03b1iys and f is given by\n\nf(x0, y0) =\n\n\u03b1iys [k((s(xi), y), (x0, y0)) \u2212 k((s(xi), yi), (x0, y0))] .\n\n(8)\n\nThere are many strategies for attempting to minimize this regularized loss, either in the primal for-\nmulation or the dual, using either batch or online algorithms. In fact, a number of previous heuristics\nfor dealing with invariances can be viewed as heuristics for approximately minimizing an approx-\nimation to an invariant loss similar to l. For this reason we believe a discussion of optimization is\nvaluable before introducing speci\ufb01c applications of the invariance loss.\n\nWhenever the are an unlimited combination of valid transformations and targets (i.e. the domain\nS \u00d7 Y is in\ufb01nite), the optimization above is a semi-in\ufb01nite program, hence exact minimization of\nR[f] or of its dual are essentially impossible. However, even is such cases it is possible to \ufb01nd\napproximate solutions ef\ufb01ciently by means of column generation. In the following we describe two\nalgorithms exploiting this technique, which are valid for both in\ufb01nite and \ufb01nite programs. One based\non a batch scenario, inspired by SVMStruct [16], and one based on an online setting, inspired by\nBMRM/Pegasos [15, 12].\n\n3.1 A Variant of SVMStruct\nThe work of [16, 10] on SVMStruct-like optimization methods can be used directly to solve regu-\nlarized risk minimization problems. The basic idea is to compute gradients of l(xi, yi, f), either one\nobservation at a time, or for the entire set of observations simultaneously and to perform updates in\nthe dual space. While bundle methods work directly with gradients, solvers of the SVMStruct type\nare commonly formulated in terms of column generation on individual observations. We give an\ninstance of SVMStruct for invariances in Algorithm 1. The basic idea is that instead of checking the\nconstraints arising from the loss functions only for y we check them for (y, s), that is, an invariance\nin combination with a corresponding label which violates the margin most.\nIf we view the tuple (s, y) as a \u201clabel\u201d it is straightforward to see that the convergence results of\n[16] apply. That is, this algorithm converges to \u0001 precision in O(\u0001\u22122) time. In fact, one may show,\nby solving the difference equation in the convergence proof of [16] that the rate can be improved to\nO(\u0001\u22121). We omit technical details here.\n\n4\n\n\fP\n\nfor i = 1 to m do\ni\n\nf(x0, y0) =P\n\nAlgorithm 1 SVMStruct for Invariances\n1: Input: data X, labels Y , sample size m, tolerance \u0001\n2: Initialize Si = \u2205 for all i, and w = 0.\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: until S has not changed in this iteration\n\nIncrease constraint set Si \u2190 Si \u222a {(s\u2217, y\u2217)}\nOptimize (6) using only \u03b1iz where z \u2208 Si.\n\nend if\nend for\n\n(s,y)\u2208Si\n\n\u03b1iz [k((s(xi), y), (x0, y0)) \u2212 k((s(xi), yi), (x0, y0))]\n(s\u2217, y\u2217) = argmaxs\u2208S,y\u2208Y \u0393(yi, y)[f(s(xi), y) \u2212 f(s(xi), yi)] + \u2206(yi, y, s)\n\u03bei = max(0, max(s,y)\u2208Si \u0393(yi, y)[f(s(xi), y) \u2212 f(s(xi), yi)] + \u2206(yi, y, s))\nif \u0393(yi, y\u2217)[f(s\u2217(xi), y\u2217) \u2212 f(s\u2217(xi), yi)] + \u2206(yi, y\u2217, s\u2217) > \u03bei + \u0001 then\n\n3.2 An Application of Pegasos\nRecently, Shalev-Shwartz et al. [12] proposed an online algorithm for learning optimization prob-\nlems of type Eq. (5). Algorithm 2 is an adaptation of their method to learning with our convex\ninvariance loss. In a nutshell, the algorithm performs stochastic gradient descent on the regularized\nversion of the instantaneous loss while using a learning rate of 1\n\u03bbt and while projecting the current\n\nweight vector back to a feasible region kfk \u2264q 2R[0]\n\n\u03bb , should it exceed it.\n\nAlgorithm 2 Pegasos for Invariances\n1: Input: data X, labels Y , sample size m, iterations T ,\n2: Initialize f1 = 0\n3: for t = 1 to T do\n4:\n5:\n\nPick (x, y) := (xt mod m, yt mod m)\nCompute constraint violator\n\n\u0393(y, \u00afy) [f(\u00afs(x), \u00afy) \u2212 f(\u00afs(x), y)] + \u2206(y, \u00afy, \u00afs)\n\n[k((s\u2217(x), y), (\u00b7,\u00b7)) \u2212 k((s\u2217(x), y\u2217), (\u00b7,\u00b7))]\n\n(s\u2217, y\u2217) := argmax\n\u00afs\u2208S,\u00afy\u2208Y\n\n(cid:3) ft + \u0393(y,y\u2217)\n\nUpdate ft+1 =(cid:2)1 \u2212 1\nq 2R[0]\nUpdate ft+t \u2190q 2R[0]\n\nif kft+1k >\n\n\u03bb\n\n\u03bbt\n\nt\nthen\n\u03bb ft+1/kft+1k\n\n6:\n\n7:\n\n8:\nend if\n9:\n10: end for\n\nWe can apply the convergence result from [12] directly to Algorithm 2. In this context note that the\ngradient with respect to l is bounded by twice the norm of \u0393(y, y\u2217) [\u03c6(s(x), y\u2217) \u2212 \u03c6(s(x), y)], due\nto Eq. (3). We assume that the latter is given by R. We can apply [12, Lemma 1] immediately:\n\nTheorem 3 Denote by Rt[f] := l(xt mod m, yt mod m, f) + \u03bb\nt. In this case Algorithm 2 satis\ufb01es the following bound:\n\nTX\n\nt=1\n\n1\nT\n\nTX\n\n\u00aft\n\nTX\n\nt=1\n\nRt[\n\n1\nT\n\nf\u00aft] \u2264 1\n\nT\n\nRt[ft] \u2264\n\nkfk\u2264q 2R[0]\n\nmin\n\n\u03bb\n\n1\nT\n\nt=1\n\n2 kfk2 the instantaneous risk at step\nTX\n\nRt[f] + R2(1 + log T )\n\n(9)\n\n.\n\n2\u03bbT\n\nIn particular, if T is a multiple of m we obtain bounds for the regularized risk R[f].\n\n4 Related work and speci\ufb01c invariances\nWhile the previous sections gave a theoretical description of the loss, we now discuss a number of\nspecial cases which can be viewed as instances of a convex invariance loss function presented here.\n\n5\n\n\fVirtual Support Vectors (VSVs): The most straightforward approach to incorporate prior knowl-\nedge is by adding \u201cvirtual\u201d (data) points generated from existing dataset. An extension of this\napproach is to generate virtual points only from the support vectors (SVs) obtained from training on\nthe original dataset [5]. The advantage of this approach is that it results in far fewer SV than training\non all virtual points. However, it is not clear which objective it optimizes. Our current loss based\napproach does optimize an objective, and generates the required support vectors in the process of\nthe optimization.\nSecond Order Cone Programming for Missing and Uncertain Data: In [2], the authors consider\nthe case where the invariance is in the form of ellipsoids around the original point. This is shown\nto correspond to a second order cone program (SOCP). Instead of solving SOCP, we can solve an\nequivalent but unconstrained convex problem.\nSemide\ufb01nite Programming for Invariances: Graepel and Herbrich [8] introduce a method for\nlearning when the invariances are polynomial trajectories. They show that the problem is equiva-\nlent to an semide\ufb01nite program (SDP). Their formulation is again an instance of our general loss\nbased approach. Since SDPs are typically hard to solve for large problems, it it is likely that the\noptimization scheme we suggest will perform considerably faster than standard SDP solvers.\nRobust Estimation: Globerson and Roweis [7] address the case where invariances correspond to\ndeletion of a subset of the features (i.e., setting their values to zero). This results in a quadratic\nprogram (QP) with a variables for each data point and feature in the training set. Solving such\na large QP (e.g., 107 variables for the MNIST dataset) is not practical, and again the algorithm\npresented here can be much more ef\ufb01cient. In fact, in the next section we introduce a generalization\nof the invariance in [7] and show how it can be optimized ef\ufb01ciently.\n\n5 Experiments\n\nKnowledge about invariances can be useful in a wide array of applications such as image recognition\nand document processing. Here we study two speci\ufb01c cases: handwritten digit recognition on the\nMNIST data, and spam \ufb01ltering on the ECML06 dataset. Both examples are standard multiclass\nclassi\ufb01cation tasks, where \u2206(y, y0, s) is taken to be the 0/1 loss. Also, we take the margin scale\n\u0393(y, y0) to be identically one. We used SVMStruct and BMRM as the solvers for the experiments.\n\n5.1 Handwritten Digits Recognition\nHumans can recognize handwritten digits even when they are altered in various ways. To test our\ninvariant SVM (Invar-SVM) in this context, we used handwritten digits from the MNIST dataset [11]\nand modeled 20 invariance transformations: 1-pixel and 2-pixel shifts in 4 and 8 directions, rotations\nby \u00b110 degrees, scaling by \u00b10.15 unit, and shearing in vertical or horizontal axis by \u00b10.15 unit.\nTo test the effect of learning with these invariances we used small training samples of 10, 20, . . . , 50\nsamples per digit. In this setting invariances are particularly important since they can compensate\nfor the insuf\ufb01cient training data. We compared Invar-SVM to a related method where all possible\ntransformations were applied in advance to each data point to create virtual samples. The virtual\nand original samples were used to train a multiclass SVM (VIR-SVM). Finally, we also trained a\nmulticlass SVM that did not use any invariance information (STD-SVM). All of the aforementioned\nSVMs were trained using RBF kernel with well-chosen hyperparameters. For evaluation we used\nthe standard MNIST test set.\n\nResults for the three methods are shown in Figure 1. It can be seen that Invar-SVM and VIR-SVM,\nwhich use invariances, signi\ufb01cantly improve the recognition accuracy compared to STD-SVM. This\ncomes at a certain cost of using more support vectors, but for Invar-SVM the number of support\nvectors is roughly half of that in the VIR-SVM.\n\n5.2 SPAM Filtering\nThe task of detecting spam emails is a challenging machine learning problem. One of the key\ndif\ufb01culties with such data is that it can change over time as a result of attempts of spam authors to\noutwit spam \ufb01lters [4]. In this context, the spam \ufb01lter should be invariant to the ways in which a\nspam authors will change their style. One common mechanism of style alteration is the insertion\nof common words, and avoiding using speci\ufb01c keywords consistently over time. If documents are\n\n6\n\n\fFigure 1: Results for the MNIST handwritten digits recognition task, comparing SVM trained on\noriginal samples (STD-SVM), SVM trained on original and virtual samples (VIR-SVM), and our\nconvex invariance-loss method (Invar-SVM). Left \ufb01gure shows the classi\ufb01cation error as a function\nof the number of original samples per digit used in training. Right \ufb01gure shows the number of\nsupport vectors corresponding to the optimum of each method.\n\nrepresented using a bag-of-words, these two strategies correspond to incrementing the counts for\nsome words, or setting it to zero [7].\n\nHere we consider a somewhat more general invariance class (FSCALE) where word counts may be\nscaled by a maximum factor of u (e.g., 1.5) and a minimum factor of l (e.g., 0.5), and the maximum\nnumber of words subject to such perturbation is limited at K. Note that by setting l = 0 and u = 1\nwe specialize it to the feature deletion case (FDROP) in [7].\n\nThe invariances we consider are thus de\ufb01ned by\n\ns(x) = {x \u25e6 \u03b1 : \u03b1 \u2208 [l, u]d, l \u2264 1 \u2264 u, #{i : \u03b1i 6= 1} \u2264 K},\n\n(10)\nwhere \u25e6 denotes element-wise product, d is the number of features, and #{\u00b7} denotes the cardinality\nof the set. The set S is large so exhaustive enumeration is intractable. However, the search for\noptimal perturbation s\u2217 is a linear program and can be computed ef\ufb01ciently by Algorithm 3 in\nO(d log d) time.\nWe evaluated the performance of our invariance loss FSCALE and its special case FDROP\nas well as the standard hinge loss on ECML\u201906 Discovery Challenge Task A dataset.4 This\ndataset consists of two subsets, namely evaluation set (ecml06a-eval) and tuning set\n(ecml06a-tune). ecml06a-eval has 4000/7500 training/testing emails with dimensionality\n206908, and ecml06a-tune has 4000/2500 training/testing emails with dimensionality 169620.\nWe selected the best parameters for each methods on ecml06a-tune and used them for the train-\ning on ecml06a-eval. Results and parameter sets are shown in Table 1. We also performed\nMcNemar\u2019s Tests and rejected the null hypothesis that there is no difference between hinge and\nFSCALE/FDROP with p-value < 10\u221232.\n\nAlgorithm 3 FSCALE loss\n1: Input: datum x, label y, weight vector w \u2208 Rd, invariance-loss parameters (K, l, u)\n2: Initialize i := 1, j := d\n3: B := y \u2217 w \u25e6 x\n4: I := IndexSort(B), such that B(I) is in ascending order\n5: for k = 1 to K do\n6:\n7:\n8:\n9:\nend if\n10:\n11: end for\n\nx[I[i]] := x[I[i]] \u2217 u and i := i + 1\nx[I[j]] := x[I[j]] \u2217 l and j := j \u2212 1\n\nif B[I[i]] \u2217 (1 \u2212 u) > B[I[j]] \u2217 (1 \u2212 l) then\n\nelse\n\n4http://www.ecmlpkdd2006.org/challenge.html\n\n7\n\n\fLoss\nHinge\nFDROP\nFSCALE\n\nAverage Accuracy % Average AUC % Parameters (\u03bb, K, l, u)\n\n74.75\n81.73\n83.71\n\n83.63\n87.79\n89.14\n\n(0.005,-,-,-)\n(0.1,14,0,1)\n\n(0.01,10,0.5,8)\n\nTable 1: SPAM \ufb01ltering results on ecml06a-eval averaged over 3 testing subsets. \u03bb is regu-\nlarization constant, (K, l, u) are parameters for invariance-loss methods. The loss FSCALE and its\nspecial case FDROP statistically signi\ufb01cantly outperform the standard hinge loss (Hinge).\n\n6 Summary\n\nWe have presented a general approach for learning using knowledge about invariances. Our cost\nfunction is essentially a worst case margin loss, and thus its optimization only relies on \ufb01nding\nthe worst case invariance for a given data point and model. This approach can allow us to solve\ninvariance problems which previously required solving very large optimization problems (e.g. a\nQP in [7]). We thus expect it to extend the scope of learning with invariances both in terms of the\ninvariances used and ef\ufb01ciency of optimization.\nAcknowledgements: We thank Carlos Guestin and Bob Williamson for fruitful discussions. Part\nof the work was done when CHT was visiting NEC Labs America. NICTA is funded through the\nAustralian Government\u2019s Backing Australia\u2019s Ability initiative, in part through the ARC. This work\nwas supported in part by the IST Programme of the European Community, under the PASCAL\nNetwork of Excellence, IST-2002-506778.\n\nReferences\n[1] Y. Abu-Mostafa. A method for learning from hints. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors,\n\nNIPS 5, 1992.\n\n[2] C. Bhattacharyya, K. S. Pannagadatta, and A. J. Smola. A second order cone programming formulation\n\nfor classifying missing data. In L. K. Saul, Y. Weiss, and L. Bottou, editors, NIPS 17, 2005.\n\n[3] C. J. C. Burges. Geometry and invariance in kernel based methods. In B. Sch\u00a8olkopf, C. J. C. Burges, and\nA. J. Smola, editors, Advances in Kernel Methods \u2014 Support Vector Learning, pages 89\u2013116, Cambridge,\nMA, 1999. MIT Press.\n\n[4] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classi\ufb01cation. In KDD, 2004.\n[5] D. DeCoste and B. Sch\u00a8olkopf. Training invariant support vector machines. Machine Learning, 46:161\u2013\n\n[6] M. Ferraro and T. M. Caelli. Lie transformation groups, integral transforms, and invariant pattern recog-\n\n190, 2002.\n\nnition. Spatial Vision, 8:33\u201344, 1994.\n\n[7] A. Globerson and S. Roweis. Nightmare at test time: Robust learning by feature deletion. In ICML, 2006.\n[8] T. Graepel and R. Herbrich. Invariant pattern recognition by semide\ufb01nite programming machines. In\n\nS. Thrun, L. Saul, and B. Sch\u00a8olkopf, editors, NIPS 16, 2004.\n\n[9] G. E. Hinton. Learning translation invariant recognition in massively parallel networks. In Proceedings\n\nConference on Parallel Architectures and Laguages Europe, pages 1\u201313. Springer, 1987.\n\n[10] T. Joachims. Training linear SVMs in linear time. In KDD, 2006.\n[11] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A.\nM\u00a8uller, E. S\u00a8ackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit\nrecognition. In F. Fogelman-Souli\u00b4e and P. Gallinari, editors, ICANN, 1995.\n\n[12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nICML, 2007.\n\n[13] P. Simard, Y. LeCun, and J. Denker. Ef\ufb01cient pattern recognition using a new transformation distance. In\n\nS. J. Hanson, J. D. Cowan, and C. L. Giles, editors, NIPS 5, 1993.\n\n[14] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks.\n\nIn S. Thrun, L. Saul, and\n\n[15] C.H. Teo, Q. Le, A.J. Smola, and S.V.N. Vishwanathan. A scalable modular convex solver for regularized\n\nB. Sch\u00a8olkopf, editors, NIPS 16, 2004.\n\nrisk minimization. In KDD, 2007.\n\n[16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. J. Mach. Learn. Res., 6:1453\u20131484, 2005.\n\n8\n\n\f", "award": [], "sourceid": 1047, "authors": [{"given_name": "Choon", "family_name": "Teo", "institution": null}, {"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}