{"title": "Adapting Codes and Embeddings for Polychotomies", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 536, "abstract": null, "full_text": "Adapting Codes and Embeddings for Polychotomies\n\nGunnar R\u00a8atsch, Alexander J. Smola\nRSISE, CSL, Machine Learning Group\n\nThe Australian National University\n\nCanberra, 0200 ACT, Australia\n\n Gunnar.Raetsch, Alex.Smola\n\n@anu.edu.au\n\nSebastian Mika\nFraunhofer FIRST\n\nKekulestr. 7\n\n12489 Berlin, Germany\nmika@first.fhg.de\n\nAbstract\n\nIn this paper we consider formulations of multi-class problems based on\na generalized notion of a margin and using output coding. This includes,\nbut is not restricted to, standard multi-class SVM formulations. Differ-\nently from many previous approaches we learn the code as well as the\nembedding function. We illustrate how this can lead to a formulation\nthat allows for solving a wider range of problems with for instance many\nclasses or even \u201cmissing classes\u201d. To keep our optimization problems\ntractable we propose an algorithm capable of solving them using two-\nclass classi\ufb01ers, similar in spirit to Boosting.\n\n1 Introduction\nThe theory of pattern recognition is primarily concerned with the case of binary classi\ufb01ca-\ntion, i.e. of assigning examples to one of two categories, such that the expected number of\nmisassignments is minimal. Whilst this scenario is rather well understood, theoretically as\nwell as empirically, it is not directly applicable to many practically relevant scenarios, the\nmost prominent being the case of more than two possible outcomes.\nSeveral learning techniques naturally generalize to an arbitrary number of classes, such as\ndensity estimation, or logistic regression. However, when comparing the reported perfor-\nmance of these systems with the de-facto standard of using two-class techniques in com-\nbination with simple, \ufb01xed output codes to solve multi-class problems, they often lack in\nterms of performance, ease of optimization, and/or run-time behavior.\nOn the other hand, many methods have been proposed to apply binary classi\ufb01ers to multi-\nclass problems, such as Error Correcting Output Codes (ECOC) [6, 1], Pairwise Coupling\n[9], or by simply reducing the problem of discriminating\n\u201cone vs. the rest\u201d\ndichotomies. Unfortunately the optimality of such methods is not always clear (e.g., how\nto choose the code, how to combine predictions, scalability to many classes).\nFinally, there are other problems similar to multi-class classi\ufb01cation which can not be\nsolved satisfactory by just combining simpler variants of other algorithms: multi-label\nproblems, where each instance should be assigned to a subset of possible categories, and\nranking problems, where each instance should be assigned a rank for all or a subset of pos-\nsible outcomes. These problems can, in reverse order of their appearance, be understood as\nmore and more re\ufb01ned variants of a multi-variate regression, i.e.\n\nclasses to\n\ntwo-class\n\nmulti-class\n\nmulti-label\n\nranking\n\nmulti-variate regression\n\nWhich framework and which algorithm in there one ever chooses, it is usually possible\nto make out a single scheme common to all these: There is an encoding step in which\n\n\u0001\n\u0002\n\u0002\n\u0003\n\u0003\n\u0003\n\u0003\n\fthe input data are embedded into some \u201ccode space\u201d and in this space there is a code\nbook which allows to assign one or several labels or ranks respectively by measuring the\nsimilarity between mapped samples and the code book entries. However, most previous\nwork is either focused on \ufb01nding a good embedding given a \ufb01xed code or just optimizing\nthe code, given a \ufb01xed embedding (cf. Section 2.3).\nThe aim of this work is to propose (i) a multi-class formulation which optimizes the code\nand the embedding of the training sample into the code space, and (ii) to develop a general\nranking technique which as well specializes to speci\ufb01c multi-class, multi-label and ranking\nproblems as it allows to solve more general problems. As an example of the latter consider\nthe following model problem: In chemistry people are interested in mapping sequences\nto structures. It is not yet known if there is an one-to-one correspondence and hence the\nproblem is to \ufb01nd for each sequence the best matching structures. However, there are only\nsay a thousand sequences the chemists have good knowledge about. They are assigned,\nwith a certain rank, to a subset of say a thousand different structures. One could try to cast\nthis as a standard multi-class problem by assigning each training sequence to the structure\nranked highest. But then, there will be classes to which only very few or no sequences\nare assigned and one can obviously hardly learn using traditional techniques. The machine\nwe propose is (at least in principle) able to solve problems like this by re\ufb02ecting relations\nbetween classes in the way the code book is constructed and at the same time trying to \ufb01nd\nan embedding of the data space into the code space that allows for a good discrimination.\nThe remainder of this paper is organized as follows: In Section 2 we introduce some basic\nnotions of large margin, output coding and multi-class classi\ufb01cation. Then we discuss the\napproaches of [4] and [21] and propose to learn the code book. In Section 3 we propose\na rather general idea to solve resulting multi-class problems using two-class classi\ufb01ers.\nSection 4 presents some preliminary experiments before we conclude.\n\n2 Large Margin Multi-Class Classi\ufb01cation\n\n.\n\n\u0004\u0007\u0006\t\b\t\b\n\b\u000b\u0006#\u0012\n\nfor multi-class problems where\n\nthe space of possible\ndenotes the number\n,\n\nseparately using a two-class technique. This can be understood as assigning to each class\n\nOutput Coding It is well known (see [6, 1] and references therein) that multi-class prob-\n\nDenote by \nthe sample space (not necessarily a metric space), by \u0001\nlabels or ranks (e.g. \u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\b\t\b\u000b\u0006\nfor a ranking problem), and let \u0011\nof classes, or \u0001\f\u0002\u000e\r\u0010\u000f\nbe a training sample of size \u0012\n\u0015\u0014\u0017\u0016\u0010\u0018\n\u0018\u001c\u001b\u001e\u001d\n \u001f!\u0001 with \"\ni.e. \u0011\u0013\u0002\n\u0006\u001a\u0019\nlems can be solved by decomposing a polychotomy into $ dichotomies and solving these\n% a binary string &\n\u0001+* of length , which is called a code word. This results\ntrained. Evaluation is done by computing the output of all ,\nnew bit-string, and then choosing the class % such that some distance measure between this\n\ncolumns of this matrix de\ufb01nes a par-\nclasses into two subsets, forming binary problems for which a classi\ufb01er is\nlearned functions, forming a\n\n\u0002-\u001f., binary code matrix. Now each of the ,\n\nstring and the corresponding row of the code matrix is minimal, usually the Hamming dis-\ntance. Ties can be broken by uniformly selecting a winning class, using prior information\nor, where possible, using con\ufb01dence outputs from the basic classi\ufb01ers. 1\n\n\u001b'\u001d\n\n\u0015(\n\n\u0004)\u0006\t\u0004\n\nin an\ntitioning of\n\nSince the codes for each class must be unique, there are /10\u001a2\n354\n\npossible code matrices to choose from. One possibility is to choose the codes to be error-\ncorrecting (ECOC) [6]. Here one uses a code book with e.g. large Hamming distance\nbetween the code words, such that one still gets the correct decision even if a few of the\nclassi\ufb01ers err. However, \ufb01nding the code that minimizes the training error is NP-complete,\neven for \ufb01xed binary classi\ufb01ers [4]. Furthermore, errors committed by the binary classi-\n\ufb01ers are not necessarily independent, signi\ufb01cantly reducing the effective number of wrong\nbits that one can handle [18, 19]. Nonetheless ECOC has proven useful and algorithms\nfor \ufb01nding a good code (and partly also \ufb01nding the corresponding classi\ufb01ers) have been\n\n(for 9;:\n\n\u000276\n\n* )\n\n\u0014\u00178\n\n1We could also use ternary codes, i.e.<>=@?\nACB)AD?\u000bE , allowing for \u201cdon\u2019t care\u201d classes.\n\n\u0002\n\u0002\n\u001d\n\u0001\n\u0014\n%\n\u0002\n3\n*\n\u001b\n8\n\fproposed in e.g. [15, 7, 1, 19, 4]. Noticeably, most practical approaches suggest to drop the\nrequirement of binary codes, and instead propose to use continuous ones.\nWe now show how predictions with small (e.g. Hamming) distance to their appropriated\ncode words can be related to a large margin classi\ufb01er, beginning with binary classi\ufb01cation.\n\n2.1 Large Margins\n\nDichotomies Here a large margin classi\ufb01er is de\ufb01ned as a mapping \u0002\u0001\u001c\u0004\u0003\nproperty that \u0019\u0005\n\nis\nsome positive constant [20]. Since such a positive margin may not always be achievable,\none typically maximizes a penalized version of the maximum margin, such as\n\n\u001b , or more speci\ufb01cally \u0019\n\n\u0018\u001c\u001b\u0007\u0006\t\b with \u0014\u0017\u0016\n\n with the\n, where \b\n\n\u0018\u001c\u001b\n\n\u0006\u001a\u0019\n\n\u0014\u0017\u0016\n\n\u0014\u0017\u0016\n\n\u0014\u0017\u0016\n\n\u0014*)\n\n\u001d\u001f\u001e\n\n\u0006\u001d\u001c\n\nand \n\n(and likewise\n\n\u0004\u0007\u0006\t\b\n\b\t\b\u000b\u0006\n\nPolychotomies While this insight by itself is not particularly useful, it paves the way for\n\n\u0018\u0014\u0013\u0016\u0015\u0018\u0017\n\u001b\u001b\u0006\n\u001a\u0019 where \u0019\n\".\u0002\n\u0018\u000e\r\u0010\u000f\u0012\u0011\nHere \u0015\u0018\u0017\nis a regularization constant and \u001e denotes the\nis a regularization term, \n! \n\u001a\u0019\n\u0004 we could rewrite the condition\nclass of functions under consideration. Note that for \u0019\n\u0018 also as \u000f\n\u001b\"\u0006\n\u001b#\u001b\n\u0014C(\nfor \u0019\n\u0004 ). In other words, we can express the margin as the difference between the\n\u001b from the target \u0004 and the target (\ndistance of \n\u0004 .\nan extension of the notion of the margin to multi-class problems: denote by $ a distance\n\u001d&%('\n\u001b@\u001d\nmeasure and by &\ncorresponding to class % . Then we can de\ufb01ne the margin \b\n\u001b of an observation\u0016 and\nclass \u0019 with respect to )\n\u0014.)\n)@\u0014\u0017\u0016\n\u001b\u001a\u001b\u0010(\n\u001b#\u001b\nThis means that we measure the minimal relative difference in distance between )\ncorrect target &\nminimize \n\n\u0004\u0007\u0006\t\b\t\b\n\b\t\u0006\n* as\n\r87\n\u0002!/1032\n465\n\u001b and any other target &\n\u00139\u0015\u0018\u0017\n\u0019 where $\n\u0018,\r\u0010\u000f\n\u0018\u001f\u0006<\u001c and )\n\u001d=\u001e\nfor all %;:\n. For the time being we chose \u0004 as a reference margin \u2014\n\u0002\u0013\u0019\nan adaptive means of choosing the reference margin can be implemented using the > -trick,\n\n\u001b (cf. [4]). We obtain accordingly the following\n\nwhich leads to an easier to control regularization parameter [16].\n\nis the length of the code) target vectors\n\n, %\n\u0001,+-\u0003\n\u0006#\u0019\n\noptimization problem:\n\n\u0018\u001c\u001b#\u001b\u001b\u0006\n\n\u0018\u001c\u001b#\u001b\u0010(\n\n, the\n\n)@\u0014\n\n\u0010*\n\n\u0006\u001a\u0019\n\n\u0018\u001c\u001b\n\n(,\n\n(1)\n\n(2)\n\n(3)\n\n2.2 Distance Measures\n\n:\n\n&>\u0006\n\n(@)\n\u0002-?D&\n&>\u0006\n\r a sym-\n*=\u0003\nis convex in )\n\nSeveral choices of $ are possible. However, one can show that only $\nand related functions will lead to a convex constraint on )\nLemma 1 (Difference of Distance Measures) Denote by $\n\u0001)\r\nmetric distance measure. Then the only case where $\n&>\u0006\n\u0006\u001a&6B occurs if $\n, where H\nProof Convexity in )\n&MB1\u0006\nonly possible if L\njoint terms in & and ) must be of linear nature in )\nthat the term must be linear in & , too, which proves the claim.\never, for quadratic $ we will get a convex optimization problem (assuming suitable \u0015\u0018\u0017\n\n\u0014*)\n\u0002<$\u0005D\n$\u001aE\nimplies that L\nis a function of ) only. The latter, however, implies that the only\n&>\u0006\n\nLemma 1 implies that any distance functions other than the ones described above will lead\nto optimization problems with potentially many local minima, which is not desirable. How-\n\n. Symmetry, on the other hand, implies\n\nis positive semide\ufb01nite. This is\n\n&GFIH\n&>\u0006\n\nis symmetric.\n\n&>\u0006\n(&A\n\n&CB\u0017\u0006\n*KJ\n\n\u001a\u0019 )\n\nfor all\n\n\u0014\n\u0016\n\u0018\n\n\u0018\n\u001d\n\u0011\n\n\u0012\n\u000b\n\f\n\u0018\n\n\u0014\n\u0016\n\u0018\n\u0004\n(\n\u0011\n\u0018\n\u0006\n\u0011\n\u0018\n\u0006\n\u0012\n\b\n\u001c\n\u0018\n\u0002\n\n\u0014\n\u0016\n\u0018\n\u0004\n(\n\u0011\n#\n\u0014\n\n\u0018\n\u001b\n(\n\u0004\n0\n(\n\u000f\n#\n\u0014\n\n\u0018\n\u001b\n(\n\u0004\n\u001b\n0\n\u0006\n\u0004\n(\n\u0011\n\u0018\n\u0018\n\u0002\n(\n\u0014\n\u0016\n\u0014\n%\n\u0002\n\u0002\n\u0006\n\u0016\n\n\b\n\u0006\n\u0016\n\u001b\n\u0001\n$\n\u0014\n&\n\u0014\n%\n\u001b\n\u0006\n$\n\u0014\n&\n\u0014\n\u0019\n\u001b\n\u0006\n)\n\u0014\n\u0016\n\b\n\u0014\n\u0019\n\u0014\n%\n\u0012\n\u000b\n\f\n\u0011\n\u0018\n)\n\u0014\n&\n\u0014\n%\n\u001b\n\u0006\n)\n\u0014\n\u0016\n$\n\u0014\n&\n\u0014\n\u0019\n\u0006\n\u0016\n\u0004\n(\n\u0011\n\u0018\n\u0018\n\u0006\n\u0011\n\u0014\n)\n\u001b\n?\n0\n0\n\u0014\n)\n\u001b\n*\n\u001f\n\n\u0014\n)\n\u001b\n\u0014\n)\n\u001b\n&\n\u0014\n)\n\u001b\n\u0014\n&\n\u001b\n\u0013\n\u001b\n\u0013\n)\n\u001d\n\n*\n0\nE\n\u0017\n$\n\u0014\n)\n\u001b\n(\n$\n\u0014\n)\n\u001b\n\u0019\n0\nE\n$\n\u0014\n)\n\u001b\n\f\u001b\u001a\u001b\n\n\u001b\u001a\u001b\n\n?D&\n\n&>\u0006\n\n)@\u0014\n\n)@\u0014\u0017\u0016\n)@\u0014\n\nH means\n)@\u0014\u0017\u0016\n\n(4)\nNote, that if the code words have the same length, the difference of the projections of\n\nand then there are ways to ef\ufb01ciently solve (3). Finally, re-de\ufb01ning \nthat it is suf\ufb01cient to consider only $\n0 . We obtain\n?D&\n)@\u0014\n\u001b\u0010(\n?D&\n) onto different code words determines the margin. We will indeed later consider a more\n\u001b , which will lead to linear constraints only and\nconvenient case: $\nIf we choose&\n\u0004)\u0006\n\b\t\b\n\b\u000b\u0006\nfrom an , dimensional subspace. In other words, we choose the subspace and perform\n\nto be an error-correcting code, such as those in [6, 1], one will often have\n. Hence, we use fewer dimensions than we have classes. This means that during\n,\n\nallows us to use standard optimization packages. However, there are no principal limitations\nabout using the Euclidean distance.\n\nregularization by allowing only a smaller class of functions. By appropriately choosing the\nsubspace one may encode prior knowledge about the problem.\n\noptimization we are trying to \ufb01nd\n\nfunctions \u0002\n\n\u001b#\u001b , %\n\n\u0002\f&\n\n\u001b\u001a\u001b\n\n\u0014\u0017\u0016\n\n\u0014\u0017\u0016\n\n2.3 Discussion and Relation to Previous Approaches\n\n4\b\u0007\n\n8\u0006\u0005\n\n)@\u0014\n\n)@\u0014\u0017\u0016\n\nNote that for &\n\u0014\u0017\u0016\n8\u001c\u0014\n\n\u001b\u0004\u0003\n\u001b\u001a\u001b and hence the problem of multi-class classi\ufb01cation reverts to the problem\n\n\u0003 we have that (4) is equal 8\n\nof solving\nbinary classi\ufb01cation problems of one vs. the remaining classes. Then our ap-\nproach turns out to be very similar to the idea presented in [21] (except for some additional\nslack-variables).\n\nA different approach was taken in [4]. Here, the function )\n\nis\noptimized. In their approach, the code is described as a vector in a kernel feature space and\none obtains in fact an optimization problem very similar to the one in [21] and (3) (again,\nthe slack-variables are de\ufb01ned slightly different).\nAnother idea which is quite similar to ours was also presented at the conference [5]. The\nresulting optimization problem turns out to be convex, but with the drawback, that one can\neither not fully optimize the code vectors or not guarantee that they are well separated.\nSince these approaches were motivated by different ideas (one optimizing the code, the\n\nis held \ufb01x and the code &\n\nis interchangeable if the function or the code, respectively, is \ufb01xed.\n\nother optimizing the embedding), this shows that the role of the code &\nding function )\nOur approach allows arbitrary codes for which a function )\nFigure 1. The position of the code words (=\u201cclass centers\u201d) determine the function )\n\nis learned. This is illustrated in\n. The\nposition of the centers relative to each other may re\ufb02ect relationships between the classes\n(e.g. classes \u201cblack\u201d & \u201cwhite\u201d and \u201cwhite\u201d & \u201cgrey\u201d are close).\n\n\u001b and the embed-\n\n, such that samples from the same class\nare close to their respective code book vector (crosses on the right). The spatial organization\n\nFigure 1: Illustration of embedding idea: The samples are mapped from the input space \ninto the code space \u0001 via the embedding function )\nof the code book vectors re\ufb02ects the organization of classes in the \nThis leaves us with the question of how to determine a \u201cgood\u201d code and a suitable )\ncan see from (4), for \ufb01xed )\n\nthe constraints are linear in & and vice versa, yet we have non-\n\n2.4 Learning Code & Embedding\n\n. As we\n\nspace.\n\n&\n\u0014\n%\n\u001b\n\u0001\n\u0002\n&\n\u0014\n%\n\u001b\n\u0014\n)\n\u001b\n\u0002\n(\n)\n?\n$\n\u0014\n&\n\u0014\n%\n\u001b\n\u0006\n\u0016\n(\n$\n\u0014\n&\n\u0014\n\u0019\n\u001b\n\u0006\n\u0002\n\u0014\n%\n\u001b\n?\n0\n(\n\u0014\n\u0019\n\u001b\n?\n0\n\u0013\n8\n&\n\u0014\n%\n\u001b\nF\n\u0016\n8\n&\n\u0014\n\u0019\n\u001b\nF\n\u001b\n\b\n\u0014\n&\n\u0014\n%\n\u001b\n\u0006\n\u0016\n\u0014\n%\n\u001b\nF\n)\n\u0014\n%\n\u001b\n,\n\u0001\n\u0002\n\u0002\n4\n\u001b\n\u0002\n$\n\u0014\n&\n\u0014\n%\n\u001b\n\u0006\n)\n\u0014\n\u0016\n\u0002\n\u0002\n\u0014\n%\n\u0002\n(\n\u0004\n\u0013\n&\n\u0014\n%\n\u001b\nF\n\u001b\n(\n8\n&\n\u0014\n\u0019\n\u001b\nF\n\u0016\n\u001b\n\u0002\n\n7\n\u0014\n\u0016\n\u001b\n(\n\n4\n\u0002\n\u0014\n%\n\fof local minima due to equivalent codes).\n\nconvex constraints, if both )\nand & are variable. Finding the global optimum is therefore\ncomputationally infeasible when optimizing ) and& simultaneously (furthermore note that\nany rotation applied to & and ) will leave the margin invariant, which shows the presence\nInstead, we propose the following method: for \ufb01xed code & optimize over )\nquently, for \ufb01xed )\n, optimize over & , possibly repeating the process. The \ufb01rst step follows\n\n[4], i.e. to learn the code for a \ufb01xed function. Both steps separately can be performed fairly\nef\ufb01cient (since the optimization problems are convex; cf. Lemma 1).\nThis procedure is guaranteed to decrease the over all objective function at every step and\nconverges to a local minimum. We now show how a code maximizing the margin can be\nfound. To avoid a trivial solution (we can may virtually increase the margin by rescaling all\nto the objective function. It can be shown that\none does not need an additional regularization constant in front of this term, if the distance\n\n& by some constant), we add \nis linear on both arguments. If one prefers sparse codes, one may use the \u0001\n\nIn summary, we obtain the following convex quadratic program for \ufb01nding the codes which\ncan be solved using standard optimization techniques:\n\n\u000f -norm instead.\n\n, and subse-\n\n\u0010\u000f\n\n?D&\n\n(5)\n\nminimize\n\n4\u0005\u0004\n\nD\u0003\u0002\n\n\u0006\b\u0007\n\t\f\u000b\n\nsubject to\n\n\u0018\u000e\n\n\u0018\u001c\u001b\u0010(\n\n?D&\n\n\u0010\u000f\n\u001b#\u001b\n\n\u0018\u001c\u001b\u001b\u0006\n\nfor all \".\u0002\n\n\u0004\u0007\u0006\t\b\n\b\t\b\t\u0006\n\nand %\n\n\u0002\u0013\u0019\n\nThe technique for \ufb01nding the embedding will be discussed in more detail in Section 3.\n\nInitialization To obtain a good initial code, we may either take recourse to readily avail-\nable tables [17] or we may use a random code, e.g. by generating vectors uniformly\n\n\u0011\u0018\u0017\n\n*\u0016\u0015\n\nis bounded by\n\nexists two such vectors (out of\n\n) that have a smaller distance than \u0011\n\n9\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\nthe random code vectors have distances greater than 8\n\ndistributed on the , -dimensional sphere. One can show that the probability that there\n8)\u001b (proof given in the full paper). Hence, with probability greater\nthan \u000f\n\u001b\n\u001c\u001e\u001d\n\u0015\u001a\u0019\n3 Column Generation for Finding the Embedding\nThere are several ways to setup and optimize the resulting optimization problem (3). For\ninstance in [21, 4] the class of functions is the set of\nhyperplanes in some kernel feature\n-norms of the hyperplane normal vectors.\n\nspace and the regularizer \u0015\u0018\u0017\nIn this section we consider a different approach. Denote by \u001f\n\r\u0010\u000f-,\n\u0004)\u0006\n\b\t\b\n\b\u000b\u0006'&\nthe regularizer \u0015\u0018\u0017\n\n\u001a\u0019 is the sum of the\u0001\na class of basis functions and let ))(\n\u000f -norm on the expansion coef\ufb01cients. We are interested in\n\n! \f\"\n\u0001\u0010+\n\u0013'01\r\u000e22\u001f\n\nfrom each other.2\n\n\". /\"\u0013\u001d\n\n. We choose\n\n\u0010\u000f$#\n\n+*\n\nsolving:\n\n\u001a\u0019 to be the \u0001\n\nsubject to\n\n032\u000503/1043\u00145\n687!9-:\n<.7!9>=\n\n\u0018\u000e\r\u0010\u000f\u001e?\n\u0018\u001c\u001b\u001a\u001b\nmore speci\ufb01cally its constraints: A\n\nTo derive a column generation method [12, 2] we need the dual optimization problem, or\n\n\u0014\u0017\u0016\n\n\u0018\u001c\u001b\n\u0006@\u001c ,\"\n\u001b\u0010(\n\n\u001b#\u001b\n\n\".\u0002\n\u0004)\u0006\n\b\t\b\n\b\u000b\u0006#\u0012\n A\"\u0007\u0014\n\n\u001bCB\n\n\u0004\u0007\u0006\t\b\t\b\n\b\u000b\u0006#\u0012\nand \u0019\n\u0004)\u0006-%\n\n\u0004)\u0006\t\b\n\b\t\b\u000b\u0006\n\n\u0004)\u0006\t\b\n\b\t\b\u000b\u0006\u0003&\n\n\u0004)\u0006\n\b\t\b\n\b\t\u0006\n\n,\n\n(6)\n\n(7)\n\n\u001b\n\u001c\u001e\u001d\n\n\u0010\u000f\n\n\u0018\u000e\r\u0010\u000f\n\u001b\n\u001c\u001e\u001d\n\n2However, also note that this is quite a bit worse than the best packing, which scales withDCE\nrather than D\nover the probability that allDGFHD.=\n\n?\u0018IKJ.L pairs have more thanM distance.\n\n. This is due a the union-bound argument in the proof, which requires us to sum\n\n\u000f\n4\n\u0014\n%\n\u001b\n?\n0\n0\n\u0007\n\n\u0012\n\u000b\n\f\n\u000f\n\u0011\n\u0018\n\u0013\n\u000f\n\f\n4\n\u0014\n%\n\u001b\n?\n0\n0\n\u0014\n&\n\u0014\n\u0019\n&\n\u0014\n%\nF\n)\n\u0014\n\u0016\n\u0004\n(\n\u0011\n\u0018\n\u0012\n:\n\u0018\n\b\n\u0002\n\u000f\n0\n\u0014\n9\n(\n\u0004\n\u001b\n0\n2\n\u000f\n\u0014\n0\n\u0002\n\u0002\n0\n\u0001\n\u0002\n\u0003\n%\n\u0002\n\u0001\n\u0002\n\"\n/\n;\n\u0007\n*\n\f\n\"\n\n\u000f\n,\n\"\n\u0013\n\n\u0012\n\u000b\n\f\n\u0018\n\u0014\n&\n\u0014\n%\n\u001b\n(\n&\n\u0014\n\u0019\nF\n\n6\n\u0006\n\u0004\n(\n?\n\u0018\n\u0006\n\u0006\n\u0019\n\u0018\n:\n\u0002\n%\n\u0002\n\u0002\n\u0018\n\u0007\n4\n\u0002\n\u0018\n:\n\u0002\n%\n\u0002\n\u0002\n\u000b\n\f\n\u000f\n\f\n@\n7\n\u0007\n5\n\n4\nA\n\u0018\n\u0007\n4\n\u0014\n&\n\u0014\n\u0019\n\u0018\n&\n\u0014\n%\nF\n\u0016\n\u0018\n\u0002\n\u001d\nE\n\u0019\n\fWe now construct a hypothesis set \u001f\n\r where \nis to extend \n\n\u0004\u0007\u0006\t\b\t\b\n\b\u000b\u0006\n\" by multiplication with vectors\u0001\n\u001f\u0003\u0002\n*\b\u0007\n\n\u001f\u0005\u0004\n\n\u0001\u000e+\n\n:\n\n\u0010*\n\n\u0004)\u0006\n\n?\t\u0004\n\n\u001f\u000b\n\n\u0010\u000f\n\nA\u0015\u0018\n\n, \"\n\n\u0004)\u0006\n\b\t\b\t\b\t\u0006#\u0012\n\nand \n\n. The idea of column generation is to start with a\n\nrestricted master problem, namely without the variables\ncorresponding dual problem (7) and then \ufb01nds the hypothesis that corresponds to a violated\nconstraint (and also one primal variable). This hypothesis is included in the optimization\nproblem, one resolves and \ufb01nds the next violated constraint. If all constraints of the full\nproblem are satis\ufb01ed, one has reached optimality.\n\n(i.e &\n\n\u001c ). Then one solves the\n\n, which has particularly nice properties for our purposes. The idea\n\nfrom a scalar valued base-class \n\n\u0001\u0010+\n\nA\u0015\u0018\n\n, we have an in\ufb01nite number of con-\nstraints in the dual optimization problem. By using the described column generation tech-\nnique one can, however, \ufb01nd the solution of this semi-in\ufb01nite programming problem [13].\nWe have to identify the constraint in (7), which is maximally violated, i.e. one has to \ufb01nd a\n\nSince there are in\ufb01nitely many functions in this set \u001f\n\" with maximal\n\u201cpartitioning\u201d\u0001\n\u0018\u001c\u001b\u000b\u0014\n\u0014\u0017\u0016\n\nand a hypothesis \n\nfor appropriate\nfor\n\n\u001b\u001a\u001b\nF\r\f\n\u0018\u000e\r\u0010\u000f\n\r\u0010\u000f\nis easy for a given \n\" :\n\" . Maximizing (8) with respect to ?\n8 , then \u0001\n\u001b ; if\n, one chooses \u0001\n\u0013\u0012\u0011\n\u000e\u0013\u0002\n\u000e \u0002\n\u000e\f\u0002\u0010\u000f\none chooses the minimizing unit vector. However, \ufb01nding \nand \u0001\ndif\ufb01cult problem, if not all \n\" are known in advance (see also [15]). We propose to test all\n. As a second step one \ufb01nds the hypothesis \npreviously used hypotheses to \ufb01nd the best \u0001\n\" . Only if one cannot \ufb01nd a hypothesis that violates a constraint, one\nthat maximizes\u0001\nemploys the more sophisticated techniques suggested in [15]. If there is no hypothesis \u0001\n\u0001>* and the\nproblem of \ufb01nding \n\" simpli\ufb01es greatly. Then we can use another learning algorithm that\n\nleft that corresponds to a violated constraint, the dual optimization problem is optimal.\nIn this work we are mainly interested in the case\n\nminimizes or approximately minimizes the training error of a weighted training set (rewrite\n(8)). This approach has indeed many similarities to Boosting. Following the ideas in [14]\none can show that there is a close relationship between our technique using the trivial code\nand the multi-class boosting algorithms as e.g. proposed in [15].\n\n, since then \u0001\n\nsimultaneously is a\n\n\u0002\u0013\u000f\n\nand for\n\n\u0015\u0014\n\n(8)\n\nmization problem to \ufb01nd the code and the embedding. We used ,\nonly one model parameters (\n ). We selected it by\n\n4 Extensions and Illustration\n4.1 A \ufb01rst Experiment\nIn a preliminary set of experiments we use two benchmark data sets from the UCI bench-\nmark repository: glass and iris. We used our column generation strategy as described\nin Section 3 in conjunction with the code optimization problem to solve the combined opti-\n. The algorithm has\n-fold cross validation on the training\ndata. The test error is determined by averaging over \ufb01ve splits of training and test data.\nAs base learning algorithm we chose decision trees (C4.5) which we only use as two-class\nclassi\ufb01er in our column generation algorithm.\nOn the glass data set we obtained an error rate of\nwas reported for SVMs using a polynomial kernel. We also computed the test error of multi-\nclass decision trees and obtained\nerror. Hence, our hybrid algorithm could\n. On the iris data we could achieve an error\n).\n\nand could slightly improve the result of decision trees (\n\n\u0017\u001f\u001b\u001c\b \u0016\u001e\u0019\nrelatively improve existing results by \u0004!\u0016\u001e\u0019\n\n. In [1] an error of\n\n\u0017\u001a\u001d\u001e\u0019\n\nrate of\n\n\u0016\u001c\b\n\n\u001b\u001c\u0019\n\n\u001b\u001a\u0019\n\n\u0018\u001a\u0019\n\n#\u001a\u0019\n\n\u000f\n7\n\u0007\n5\n\n4\n\u0007\n4\nB\n\n\u0017\n\u0012\n\u0002\n\n\u0002\n\u001f\n\u0002\n\n \n@\n\"\n\u0003\n%\n\u0002\n\n&\n\u0001\n@\n\u001d\n\u0002\n\u0006\n\u0001\n \n\u0003\n\n\u0007\n\u0001\n\u001d\n\n*\n\u0006\n?\n\u0001\n\u0002\n \n\u001d\n\n\b\n@\n\u000b\n\f\n\u000f\n\f\n7\n\u0007\n5\n\n4\n\u0007\n4\n \n@\n\"\n&\n\u0014\n\u0019\n\u0018\n\u001b\n(\n&\n\u0014\n%\nF\n\u0001\n\u0002\n\u0001\n@\n\"\n\u0006\n\f\n@\n\u0001\n?\n\u0004\n\u0002\n\u0004\n@\n\u0002\n2\n\u0014\n\f\n@\n\"\n\u0002\n\f\n@\n\"\n\u0017\n?\n\f\n@\n\"\n?\n0\n\u0004\n@\n@\n\"\nF\n\f\n@\n \n\u000e\n\u001d\n\u0004\n@\n\u0002\n\u0002\n\u0016\n\u0017\n\u001c\n\b\n\u0014\n\u0014\n\u0016\n\b\n\"\n\b\n\u001c\n\u0019\n\u0014\n8\n\b\n\u001b\n\b\n\u001c\n\u0019\n\u0014\n\u0016\n\b\n\u001c\n\u0019\n\fHowever, SVMs beat our result with 8\n\nerror [1]. We conjecture that this is due to the\nproperties of decision trees which have problems generating smooth boundaries not aligned\nwith coordinate axes.\nSo far, we could only show a proof of concept and more experimental work is necessary.\nIt is in particular interesting to \ufb01nd practical examples, where a non-trivial choice of the\ncode (via optimization) helps simplifying the embedding and \ufb01nally leads to additional\nimprovements. Such problems often appear in Computer Vision, where there are strong re-\nlationships between classes. Preliminary results indicate that one can achieve considerable\nimprovements when adapting codes and embeddings [3].\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\nFigure 2: Toy example for learning missing classes. Shown is\nthe decision boundary and the con\ufb01dence for assigning a sample\nto the upper left class. The training set, however, did not contain\nsamples from this class. Instead, we used (9) with the informa-\ntion that each example besides belonging to its own class with\ncon\ufb01dence two also belongs to the other classes with con\ufb01dence\none iff its distance to the respective center is less than one.\n\n4.2 Beyond Multi-Class\nSo far we only considered the case where there is only one class to which an example\nbelongs to. In a more general setting as for example the problem mentioned in the intro-\nduction, there can be several classes, which possibly have a ranking. We have the sets\n, which contain all pairs of \u201crelations\u201d between\ncontains all pairs of positive and negative classes of\n\n\u0015\u0014\n\n\u0018\u0002\u0001\nthe positive classes. The set \u0003\n\nan example.\n\n\u0004\u0007\u0006\t\b\t\b\n\bD\u0006#\u0012\n\u0018\u0004\u0001\n\nminimize\n\u0006\u000f\b\u0011\u0010\n\u0012\u0013\u0010\n\r\u0010\u000f5\u0006\u001a&\n\n\u0005\u0007\u0006\t\b\u000b\n\n\u0006\u0016\u0015\nsubject to\n\nA\u000e\n\n%\u0018\u0017\n\n\u0010\u000f\n\n\u001e\u001d\n\n\u0010\u000f\n\n\u0018\u000e\r\u0010\u000f\u001e?\n\n\u0018\u000e\r\u0010\u000f\n\n?\u000b&\n\n(9)\n\n\u001a\u0019\n\n\u0018\u000e\r\n\u001b#\u001b\nfor all \".\u0002\n\u001b#\u001b\nfor all \".\u0002\n! A\"\n\u0001\u000e+\n\n\u001b\u001b\u0006\nand \u0014\n\u001b\u001b\u0006\nand \u0014\nwhere )\n*\u000e#\u0012%\nIn this formulation one tries to \ufb01nd a code & and an embedding )\n\n\u0014\u0017\u0016\n\u0004\u0007\u0006\t\b\n\b\t\b\u000b\u0006\n\u0014\u0017\u0016\n\u0004\u0007\u0006\t\b\n\b\t\b\u000b\u0006\n\n\u001b and \u001f7\u0002\n\n, such that for each ex-\nample the output wrt. each class this example has a relation with, re\ufb02ects the order of this\nrelations (i.e. the examples get ranked appropriately). Furthermore, the program tries to\nachieve a \u201clarge margin\u201d between relevant and irrelevant classes for each sample. Similar\nformulations can be found in [8] (see also [11]).\nOptimization of (9) is analogous to the column generation approach discussed in Section 3.\nWe omit details due to constraints on space. A small toy example, again as a limited proof\nof concept, is given in Figure 2.\n\n\u0004)\u0006\t\b\n\b\t\b\u000b\u0006\u0003&\n\n\". A\"\u0007\u0014\n\n\u0010\u000f\n\n.\n\nmulti-dimensional regression.\n\nConnection to Ranking Techniques Ordinal regression through large margins [10] can\nbe seen as an extreme case of (9), where we have as many classes as observations, and each\nis to\n\n\" . This formulation can of course also be understood as a special case of\n\npair of observations has to satisfy a ranking relation \nbe preferred to \u0016\n5 Conclusion\nWe proposed an algorithm to simultaneously optimize output codes and the embedding of\nthe sample into the code book space building upon the notion of large margins. Further-\n\n\" , if \u0016\n\n\u001b\u0018\u0006\n\n\u0019\n\n\u0002\n\u0002\n%\n\u000f\n\u0006\n%\n0\n\u001b\n#\n%\n\u000f\n\u0006\n%\n0\n\u001d\n\n\u0001\n\u0001\n\u0002\n\f\n\f\nA\n\u0014\nB\n\u001d\n\n\u000b\n'\n\u001d\n\u0014\n\u001b\n*\n\f\n\"\n,\n\"\n\u0013\n\u000b\n\f\n\u000f\n\u001b\n\u0019\n\u0007\n\u001b\n\f\n\u001c\n?\n\u0018\n\u0007\n\u001c\n\u0013\n\u000b\n\f\nB\n\u0018\n\u0013\n\u000b\n\f\n\u0014\n%\n\u0018\n\u001b\n?\n0\n0\n\u0014\n&\n\u0014\n%\n\u000f\n\u001b\n(\n&\n\u0014\n%\n0\nF\n)\n6\n\u0018\n\u000e\n(\n?\n\u0018\n\u0007\n\u001c\n\u0012\n%\n\u000f\n\u0006\n%\n0\n\u001b\n\u001d\n\n\u0018\n\u0014\n&\n\u0014\n%\n\u000f\n\u001b\n(\n&\n\u0014\n%\n0\nF\n)\n6\n\u0018\n\u0004\n(\n?\nB\n\u0018\n\u0012\n%\n\u000f\n\u0006\n%\n0\n\u001b\n\u001d\n\u0003\n\u0018\n6\n\u0014\n\u0016\n\u001b\n\u0001\n\u0002\n\n*\n\"\n,\n\u0016\n\u0003\n\n\u0002\n\u0001\n\u0014\n\u0016\n\u0017\n\u001b\n(\n\n\u0014\n\u0016\n\"\n\b\n(\n\u0011\n\u0017\n\u0017\n\fmore, we have shown, that only quadratic and related distance measures in the code book\nspace will lead to convex constraints and hence convex optimization problems whenever\neither the code or the embedding is held \ufb01xed. This is desirable since at least for these\nsub-problems there exist fairly ef\ufb01cient techniques to compute these (of course the com-\nbined optimization problem of \ufb01nding the code and the embedding is not convex and has\nlocal minima). We proposed a column generation technique for solving the embedding op-\ntimization problems. It allows the use of a two-class algorithm, of which there exists many\nef\ufb01cient ones, and has connection to boosting. Finally we proposed a technique along the\nsame lines that should be favorable when dealing with many classes or even empty classes.\nFuture work will concentrate on \ufb01nding more ef\ufb01cient algorithms to solve the optimization\nproblem and on more carefully evaluating their performance.\nAcknowledgements We thank B. Williamson and A. Torda for interesting discussions.\n\nReferences\n[1] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach\n\nfor margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2000.\n\n[2] K.P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting.\nIn P. Langley, editor, Proc. 17th ICML, pages 65\u201372, San Francisco, 2000. Morgan Kaufmann.\n[3] B. Caputo and G. R\u00a8atsch. Adaptive codes for visual categories. November 2002. Unpublished\n\nmanuscript. Partial results presented at NIPS\u201902.\n\n[4] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass prob-\nlems. In N. Cesa-Bianchi and S. Goldberg, editors, Proc. Colt, pages 35\u201346, San Francisco,\n2000. Morgan Kaufmann.\n\n[5] O. Dekel and Y. Singer. Multiclass learning by probabilistic embeddings. In NIPS, vol. 15. MIT\n\nPress, 2003.\n\n[6] T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output\n\ncodes. Journal of Ariti\ufb01cal Intelligence Research, 2:263\u2013286, 1995.\n\n[7] V. Guruswami and A. Sahai. Multiclass learning, boosing, and error-correcting codes.\n\nIn\nProc. of the twelfth annual conference on Computational learning theory, pages 145\u2013155, New\nYork, USA, 1999. ACM Press.\n\n[8] S. Har-Peled, D. Roth, and D. Zimak. Constraint classi\ufb01cation: A new approach to multiclass\n\nclassi\ufb01cation and ranking. In NIPS, vol. 15. MIT Press, 2003.\n\n[9] T.J. Hastie and R.J. Tibshirani. Classi\ufb01cation by pairwise coupling.\n\nIn M.I. Jordan, M.J.\nKearnsa, and S.A. Solla, editors, Advances in Neural Information Processing Systems, vol. 10.\nMIT Press, 1998.\n\n[10] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres-\nsion. In A. J. Smola, P. L. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in\nLarge Margin Classi\ufb01ers, pages 115\u2013132, Cambridge, MA, 2000. MIT Press.\n\n[11] R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, vol. 15. MIT Press, 2003.\n[12] S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY,\n\n1996.\n\n[13] G. R\u00a8atsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in in\ufb01nite and \ufb01nite\nhypothesis spaces. Machine Learning, 48(1-3):193\u2013221, 2002. Special Issue on New Methods\nfor Model Selection and Model Combination.\n\n[14] G. R\u00a8atsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, and K.-R. M\u00a8uller. Barrier boosting. In\n\nProc. COLT, pages 170\u2013179, San Francisco, 2000. Morgan Kaufmann.\n\n[15] R.E. Schapire. Using output codes to boost multiclass learning problems. In Machine Learning:\n\nProceedings of the 14th International Conference, pages 313\u2013321, 1997.\n\n[16] B. Sch\u00a8olkopf, A. Smola, R.C. Williamson, and P.L. Bartlett. New support vector algorithms.\n\nNeural Computation, 12:1207 \u2013 1245, 2000.\n\n[17] N. Sloane. Personal homepage. http://www.research.att.com/\u02dcnjas/.\n[18] W. Utschick. Error-Correcting Classi\ufb01cation Based on Neural Networks. Shaker, 1998.\n[19] W. Utschick and W. Weichselberger. Stochastic organization of output codes in multiclass\n\nlearning problems. Neural Computation, 13(5):1065\u20131102, 2001.\n\n[20] V.N. Vapnik and A.Y. Chervonenkis. A note on one class of perceptrons. Automation and\n\nRemote Control, 25, 1964.\n\n[21] J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR-98-\n\n04, Royal Holloway, University of London, Egham, 1998.\n\n\f", "award": [], "sourceid": 2166, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Sebastian", "family_name": "Mika", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}