{"title": "Kernel Machines and Boolean Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 439, "page_last": 446, "abstract": null, "full_text": "Kernel Machines and Boolean Functions\n\nAdam Kowalczyk\n\nTelstra Research Laboratories\nTelstra, Clayton, VIC 3168\n\na.kowalczyk@trl.oz.au\n\nAlex J. Smola, Robert C. Williamson\n\nRSISE, MLG and TelEng\nANU, Canberra, ACT, 0200\n\n Alex.Smola, Bob.Williamson\n\n@anu.edu.au\n\nAbstract\n\nWe give results about the learnability and required complexity of logical\nformulae to solve classi\ufb01cation problems. These results are obtained by\nlinking propositional logic with kernel machines. In particular we show\nthat decision trees and disjunctive normal forms (DNF) can be repre-\nsented by the help of a special kernel, linking regularized risk to separa-\ntion margin. Subsequently we derive a number of lower bounds on the\nrequired complexity of logic formulae using properties of algorithms for\ngeneration of linear estimators, such as perceptron and maximal percep-\ntron learning.\n\n1 Introduction\n\nThe question of how many Boolean primitives are needed to learn a logical formula is\ntypically an NP-hard problem, especially when learning from noisy data. Likewise, when\ndealing with decision trees, the question what depth and complexity of a tree is required to\nlearn a certain mapping has proven to be a dif\ufb01cult task.\n\nWe address this issue in the present paper and give lower bounds on the number of Boolean\nfunctions required to learn a mapping. This is achieved by a constructive algorithm which\ncan be carried out in polynomial time. Our tools for this purpose are a Support Vector\nlearning algorithm and a special polynomial kernel.\n\nIn Section 2 we de\ufb01ne the classes of functions to be studied. We show that we can treat\npropositional logic and decision trees within the same framework. Furthermore we will\nargue that in the limit boosted decision trees correspond to polynomial classi\ufb01ers built\ndirectly on the data. Section 3 contains our main result linking the margin of separation\nto a simple complexity measure on the class of logical formulae (number of terms and\ndepth). Subsequently we apply this connection to devise test procedures concerning the\ncomplexity of logical formulae capable of learning a certain dataset. More speci\ufb01cally,\nthis will involve the training of a perceptron to minimize the regularized risk functional.\nExperimental results and a discussion conclude the paper. Some proofs have been omitted\ndue to space constraints. They can be found in an extended version of this paper (available\nat http://www.kernel-machines.org).\n\n\u0001\n\f2 Polynomial Representation of Boolean Formulae\n\n\u0007+,\u0006\u000e-\n\ngiven by the expansion\n\nI\tJ\n\n\u0001\u0016\u0015\u0018\u0017\u001a\u0019\u001c\u001b\n\n.\n\nIn other words, we attempt to learn a binary\n\nand moreover \u001b.)\n\nfunction on Boolean variables. A few de\ufb01nitions will be useful.\n\n243\n\n5K%\nand we use a compact notation \u0002\n\n, with the usual convention of\n\n(1)\n\nWe use the standard assumptions of supervised learning: we have a training set\n. Based on these observations we attempt to \ufb01nd a\n\ngoodness of \ufb01t is measured relative to some prede\ufb01ned loss function or a probabilistic\nmodel.\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u0001\u0004\u000b\n\u000b\u0006\r\f\r\f\u000e\f\u000f\u0006\u000e\u0003\u0002\u000f\u0010\u0011\u0006\u0012\b\u0013\u0010\u0014\n\nfunction \u001d\u001f\u001e\u000f\u0017! \"\u001b which incorporates the information given by the training set. Here\nWhat makes the situation in this paper special is that we assume that \u0017$#&%(' where\n%*)\nThe set of all polynomials\u001d of degree 2436587 on%9' will be denoted1 by:<;\u0013=?> . They are\nfor every \u0002\nLUTWV\nT[Z\n\u0004YX\u000eX\rX\nL_^\n\u001eS)\nexpansions thatF\nThe subsetdfefg\n\n0/1-\nC whereGH\u0015\n\u0003\u0002@\n\n' and J\n\u0007I\tJ\nC?D\u0013E<F\n)BA\n\u0006\r\fN\fN\u0006\u0012L\n, where F\n\u001eS)\n5PO\n5RQ\nfor every M\\\t\u0004\u0007\u0006\r\fN\fN\f[\u0006\t\\\nfor monomials on%]'\n5Y%\u0011'\nfor every La`\u0018+ . In order to avoid further notation we always assume in such\nCcb\nfor allI\n58G .\n\u00154:<;h=N> of all polynomials of the form\nC whereG\u0014\u0015\n-lk4m\nM\u0002@\n\n\u0007I\tJ\n5n%\nC?D\u0013E\n)ji\nrts\tu\nv commonly used in the propositional logic. The latter\ndisjunctive normal forms d\u0014epg\n\u0006\t~\u0080\u007f\u0082\u0081\nconsist of all clauseswx\u001e\u0013%\ngzyf{\u0005|z}\n-\r\u0084 and \u0083\nlogical primitives\u0083\nof terms, each being a conjunction of up to3\nh-h\u0006\r\f\r\f\u000e\f\u0089\u0006\u0012\u0087\nthere exists an\\\u0012\u0088\nLemma 1 Assume for each\\\nh-h\u0006\r\f\r\f\u000e\f\u0086\u0006\u0012\u0087\nT[\u008a for all\u0002\nML\u0086\u0004\u0085\u0006\r\fN\f[\fN\u0006\u0012L\n5KO\nv such that for every\u0002\nrts\tu\nthere existswx5ndfefg\nThen for every\u001dn5nd\u0014epg\n\u0003\u0002@\nl\u008b\u008c~\u0080\u007f\u008d\u0081\n\u0003\u0002@\n\u0082`B-\nif and only if w\nAnd vice versa, for every suchw\nthere exists\u001d\nsatisfy (3) by arti\ufb01cially enlargingO\u008e\u0015o%6'\n\u0001\u0092\u00154%\u0011\u0093\t'\nNow we consider another special subclass of polynomials,d\n> , called deci-\nsion trees . These are polynomials which have expansions of type (1) where M\\\u0094\n all coef\ufb01-\nand \u0003\\\u0095\\\u0096\n for every\u0002\n5\u001cG , \u2018\ufb01res\u2019, i.e.\ncientsF\n5\u0097O\nexactly one of the numbers \u0098\u0002\n\ninto \u008fML\n58O\n\u0015a:<;\u0013=\nC ,I\nexactly one monomial\u0002\nD\u0013\u0099 equals 1 and all the others are 0.\n\nStanding Assumption: Unless stated otherwise, we will in the rest of the paper assume\nthat (3) of the above lemma holds. This is not a major restriction, since we can always\n\nwill be called disjunctive normal forms. It is linked by the following lemma to the set of\n\nEq. (1) shows that each decision tree can be expressed as half of a difference between two\ndisjunctive normal forms such that for any given input, one and only one of the conjunctions\ncomprising them will be true. There exists also an obvious link to popular decision trees (on\nBoolean variables) used for classi\ufb01cation in machine learning, cf. [4, 12]. Here the depth\n\n' and J\n\nI\tJ\n\n2o3\n\nwhich can be expressed by disjunctions\n\n\u0006\u000e-\n\n\u0080\u0090\u0091ML\n\nof a leaf equals the degree of the corresponding monomial, and the coef\ufb01cient F\n\ncorresponds to the class associated with the leaf.\n\n1Such binary polynomials are widely used under the name of score tables, e.g. typically loan\n\napplications are assessed by \ufb01nancial institutions by an evaluation of such score tables.\n\n\u0085/1-\n\n+\u0085\u0084 .\n\nsuch that\n\n5KO\n\n.\n\nML\n\n-h\u0006\r-\n\nsatisfying the above relation.\n\n(2)\n\n(3)\n\n\u0001\n\u0001\n\u001d\nC\n\u0002\nI\n\u0001\n\u0006\n)\n\u0004\n'\n\nC\nC\nL\n'\n'\n\n-\n)\n+\n>\n\u001d\nA\n\u0002\nI\n\u0001\n>\nq\n'\n \n\n}\n\u0001\nL\nT\n)\nL\nT\n)\n5\n\u0001\n5\n\u0001\nL\nT\n)\n-\ni\nL\n)\n'\n\n\f\n>\n>\nq\n\u001d\n}\n\f\nT\ni\nL\nT\nT\n\n~\n>\n\nO\n\nC\n5\n\ni\n\u0001\nC\n\u0001\nT\nC\n5\n\u0001\n\f3 Reproducing Kernel Hilbert Space and Risk\n\nKernel The next step is to map the complexity measure applied to decision trees, such as\ndepth or number of leaves, to a Reproducing Kernel Hilbert Space (RKHS), as used in Sup-\n\nrealizing the dot product corre-\n\nM\u0002\n\n\u0003\u0002\n\n\f\u000b\n\nT\b\u0007\n\nform:\n\n(4)\n\n(5)\n\n(6)\n\nare complexity weights for each degree of the polynomi-\n\n+ with\\\n\n\u0098+_\u0006\r\f\r\f\u000e\f\u0089\u0006\nC are the coef\ufb01cients of expansion (1).\n\n(4). This gives us the desired Hilbert space. From [1] we obtain that there exists a unique\n. The key observation for derivation of its form (5)\n\nis\ncomplete. Furthermore it is easy to check that (4) de\ufb01nes a homogeneous quadratic form\n\nport Vector machines. This is de\ufb01ned as\u008e)a:<;h=M> with the scalar product corresponding\nto the norm de\ufb01ned via the quadratic form on\u001d\nCWD\u0013E\u0004\u0003\u0006\u0005\nHere\u0003\nals and the coef\ufb01cientsF\nLemma 2 (Existence of Kernel) The RKHS kernel \t\nsponding to the quadratic form (4) with\t\n> has the following ef\ufb01cient functional\n58:<;\u0013=\n\u0006\u0012\u0002\u000f\u0088\u0013\u0012\n\u000f\u0011\u0010\n\u0006\t\u0002\n\u0003\u000e\r\nis well de\ufb01ned by (4) for all \u001d\n\nProof The norm \u0001\n:<;\u0013=?> and the space :<;\u0013=?>\non:<;\u0013=?> . Via the polarization identity we can reconstruct a bilinear form (dot product) from\n\u0003\u0002\n\u0006\u0012\u0002@\u0088?\n corresponding to\u0001\nkernel\t\n\u008a\u001e\u001d\nis that given\u0002\n\u0006\u0012\u0002@\u0088\nand\u0014\n\n \u001f non-vanishing monomials of\nthere are exactly\u0016\u0018\u0017\u001a\u0019\u001c\u001b\n5\u0016O\nthe form\u0002\nL\"!$#\u000bL\nL\"!\n\u0087 are positions\n# , where -\n\u0004('\nX\rX\u000eX\n2&%\nof 1\u2019s in the sequenceI .\n+ and3\n\n with)\n\u0012 , (5) simply leads\nNote that for the special case where\u0003\n)*)\n\u0006\u0012\u0002\u000f\u0088\u0013\u0012\n\u000f\u0011\u0010\n\u0094-lk\n\u0017\u001a\u0019\u001c\u001b\n\n\f\u000b\nThe larger) , the less severely we will penalize higher order polynomials, which provides\n\u0002cJ , and always holds for31)\n`aJ\napplicable to the case when3\nC ind\u0014epg\n\n we obtain\nDue to the choice of theF\n> andd\nJNJ\nJNJ\n-lk,+\n> and JNJ\n\u0005 for\u001dn5nd\u0014epg\nC?D\u0013E\u0004\u0003\u0006\u0005\nC?D\u0013E\u0004\u0003-\u0005\n\n of size.\n\u0006\u0012\b\nFor our training set M\u0002\nand a regularization constant/\nJNJ\nJNJ\n\n\u0012\n\n\u001eS)\nJNJ\nJNJ\n\n\u0096\u0084\n\nNext we introduce regularized risk functionals. They follow the standard assumptions made\nin soft-margin SVM and regularization networks.\n\n\u0005 for\u001dY5Kd\n+ we de\ufb01ne\n\nus with an effective means of controlling the complexity of the estimates. Note that this is\n\n\u0006\u0012\u0002\n\nX\u000eX\rX\n\n\u008a\u001e\u001d\n\n.\n\n\u0012-\n\n\u0003\u0002\n\u0003\u0002\n\nto a binomial expansion and we obtain\n\nM\u0002\n\n\u0006\u0012\u0002\n\nJNJ\n\n021\n\n\u0001\n\u001d\n\u0001\n\u0093\n\u0002\n\u001e\n)\nA\nC\n\u0005\nF\n\u0093\nC\n\f\n5\n3\n\u0001\n\u0006\nX\n\n\t\n\u0088\n\n)\n>\nA\n^\n\u0004\n\n\u0002\n\u0014\n\u0015\n\f\n\u001d\n\u0001\n\u0002\n5\nX\n\u0001\n\u0093\n\u0002\n\u0007\n+\n\u0019\nC\n\u0002\n\u0088\nC\n)\nV\nL\n\u0088\n!\nV\n\u0088\n!\n%\n\u0093\n'\n'\n%\n\n2\n\n\u0007\n`\n\u0010\n\u0002\n\u0088\n\t\n\u0088\n\n)\n>\nA\n^\n)\n\n\u0002\n\u0014\n\u0015\n)\n)\n\n\u0019\n\f\n\u0087\n~\n>\n\nO\n\u001d\n\u0093\n\u0002\n)\nA\nC\n\u001d\n\u0093\n\u0002\n)\nA\nC\n~\n>\n\nO\n\n\f\nT\nT\n\u0007\n0\n\u0083\n\u001d\n\u0006\n/\n\u0084\n\u001d\n\u0093\n\u0002\nk\n/\n\n\u0004\n\u0010\nA\nT\n\u000b\n\u0004\ni\n\b\nT\n\u001d\nT\n\u0093\n\u0006\n\u0083\n\u001d\n\u0006\n/\n\u0084\n\u001e\n)\n\u001d\n\u0093\n\u0002\nk\n/\n\n\u0004\n\u0010\nA\nT\n\u000b\n\u0004\n\u0083\n-\ni\n\b\nT\n\u001d\nT\n\u0093\n1\n\u0006\n\f.\n\n(7)\n\n>\b\u0007\n\nwhere\n\n021\n\u0004\n\t\f\u000b\r\u000b\n\n\u0084 . Furthermore, if\n\ndenotes the number of classi\ufb01cation errors (on the training set).\n\nThe \ufb01rst risk is typically used by regularization networks [8], the other by support vector\n\n for every15nQ\n\u0003+,\u0006\n> , where \u0083\nfor every\u001dY5K:<;h=\n\u001eS)\u0002\u0001\u0004\u0003\u0006\u0005\n\u0084H`\n> we have0\nmachines [5]. Note that for all \u001d\n:<;\u0013=\n\n , then J\nM\u0002\n\n\u000eJ\u008f`\n- and hence\n\u001dY5Kd\u0014epg\nk\u000e+\n\u0084\u0005`\n\u0084@`\n021\nC?D\u0013E\u0004\u0003\u0006\u0005\n\t\u000e\u000b\u000f\u000b\nM\u0002\n\u000e\\\u008dJ0\b\n)\u0011\u0010\n\n and in such a case the risks\nNote that in (7) equalities hold throughout for \u001d\u001f5od\n\ufb01cients\u0003-\u0005\n\u0084 are exactly equal to the \u201ccost com-\n\u001dY5Kd\nof subtrees of the maximal tree, with the regularisation constant/\nOur reasoning in the following relies on the idea that if we can \ufb01nd some function\u001dx5n:<;h=\u0003>\n\u0084 , then the minimizer of the risk functionals, when cho-\nwhich minimizes0\n\n or\u001d\u00165\u0097dfefg\n> , must have a risk functional\nsen from the more restrictive set\u001d\u001c5\u0097d\nat least as large as the one found by optimizing over:<;h=\n> . This can then be translated into\n\u00154:<;h=N> .\nsinced\na lower bound on the complexity of\u001d\ndfefg\n\nare fully determined by the depths of the leaves of the decision tree and the number of\nclassi\ufb01cation errors. Furthermore, in the particular case of decision trees and all coef-\nequals to the number of leaves of the decision tree\n\nplexity\u201d employed to prune decision trees by CART algorithm [4]. In other words, the basis\nof the pruning algorithm in CART is the minimisation of the regularised risk in the class\nselected by a heuristic\n\nJNJ\n- , i.e. when JNJ\n\u0005\u0091)\n\n , the regularized risks0\n\napplied to a validation set.\n\n\u0084 or0\n\n(8)\n\n4 Complexity Bounds\n\n; cf. [7, 6] for details.\n\nit becomes\nit implements\n\nThe last part missing to establish a polynomial-time device to lower-bound the required\n\nmaximum margin perceptron and establish bounds on execution time and regularized risk.\n\n\u0084 or\n\u0084 . In this section we will study two such methods: the kernel perceptron and the\n/ -perceptron learning algorithm is a direct modi\ufb01cation\n\ncomplexity of a logical formula is to present actual algorithms for minimizing0\n021\nKernel Perceptron Test The\t\nof ordinary linear perceptron learning rule. In the particular case of/\nthe ordinary perceptron learning rule in the feature space Q\u0014\u0013\n. For/\nperceptron learning rule in the extended feature spaceQ\u0014\u0013\u0018\u0019KQ\nAlgorithm 1 Regularized kernel perceptron (\t\n/ -perceptron)\n`o+ .\nGiven: a Mercer kernel\t and a constant/\n+ and%\n-\u0013\u0006\u000e\f[\fN\f[\u0006\nInitialize: \u0015\n+ , then update:\n\n@k\n\u0006\u0012\u0002\n\ufb01nd\u0014 such that\b\n\n\u0017\u0016\nk\u008c-\u0013\f\nk\u001f- and\u0015\n\n , \u001f\nT\u001e\u001d\nT$\u001d\n) \u0001\u0004!#\"\n. Note that \u001d&%\nand (\n\n\u0003\u0002\n\n\u0019\u0018\n\u0006\t\u0002@\n for every \u0002\n\n .\nM\u0002\n\nWe introduce the special notation:\n\nwhile an update is possible do\n\nM\u0002\n\u0006\u0012\u0002\n5o:<;h=\n\n)\u001b\u0001\u001c\u0003\u0006\u0005\n5oO\n\n and\n> and\n\n\u0006\u0012\u0002\n5oQ\n\nend while\n\nfor\\\n\nM\u0002@\n\nJNJ\n\n\u0006\u0012\u0002\n\nM\u0002\n\n.\n\n\u001d&%\nJNJ\n\n\u001eS)\n\n\u0003\u0002\n\n\n\u0084\n1\n\n5\n\u0083\n\u001d\n\u0006\n/\n\u0083\n\u001d\n\u0006\n/\nd\n~\n>\n\nO\n\u001d\nT\n0\n\u0083\n\u001d\n\u0006\n/\n\u0083\n\u001d\n\u0006\n/\n\u0099\nA\nC\n\u0005\n/\n\n\n\u001d\n\n\n\u001d\n\n\u001e\nT\nb\n)\n\u001d\nT\n\n\u0001\n~\n>\n\nO\nT\n\u001d\n\u0093\n\u0012\n~\n>\n\nO\n\u0083\n\u001d\n\u0006\n/\n\u0084\n)\n0\n1\n\u0083\n\u001d\n\u0006\n/\n\u0083\n\u001d\n\u0006\n/\n1\n\u0083\n\u001d\n\u0006\n/\n~\n>\n\nO\n~\n>\n\nO\n\n\u0006\n>\n\u0083\n\u001d\n\u0006\n/\n\u0083\n\u001d\n\u0006\n/\n\u0006\n)\n+\n\u0007\n+\n\u0010\n\u0006\n)\nT\n)\n+\n)\n.\n\u0010\nT\n\u000b\n\u0004\n\b\nT\n%\nT\n\t\n\nT\n/\n%\n\n2\n%\n%\n\n\u0018\n\u0015\n\u001a\n\t\nT\nT\n\t\nT\nT\n!\n\u001b\n'\n\u0016\nT\n\b\nT\n%\nT\n\t\nT\n%\n\u0010\n!\n\u001b\n'\n\u001d\n%\n!\n\u001b\n'\n\u0093\n\u0002\n)\n\u0016\nT\n\n\b\nT\n\b\n\n%\nT\n%\n\n\t\nT\n\n\fA modi\ufb01cation of the standard proof of convergence of linear perceptron [11] combined\nwith the extended feature space trick [13] gives the following result.\n\nJNJ\n\n\u0003\u0002\n\n\u0010 were generated after\u0015 -th update\n5KQ\n\n\t\b,JNJ\n\nTheorem 3 Assume that the coef\ufb01cients (\n%\u0097)\nof the\t\n\u0001\u0004!\nThen\u0015\u008d2\n\n/ -perceptron and\n\u0002\u0001\n\n\f\b,\n\u0002\u0001\n\u001a\u0092\u0093\n\u0084\u0005`\n\u0084\u0005`\n\u0001 de\ufb01ned above is the maximal margin of separation of the training data by\n\n\u0002\u000b\n\nfor every\u001dn5n:<;\u0013=?> .\n\nMaximum Margin Perceptron Test Below we state formally the soft margin version of\nmaximal margin perceptron algorithm. This is a simpli\ufb01ed (homogeneous) version of the\nalgorithm introduced in [9].\n\nNote that\npolynomials from:<;h=\n\nD\u0005\u0004\u0007\u0006\n\u0001\u001c\u0003\n\n and\n\u0015\u0094\u0093\nJ[J\n\n(treated as elements of the RKHS).\n\n(9)\n\nJNJ\n\nJNJ\n\nJ[J\n\n;\n\n.\n\n;\n\n/\u0002\u000e\n\nend for\n\nend while\n\n+ &+\n\n\ufb01nd\\\n\nAlgorithm 2 Greedy Maximal Margin Perceptron (\t\n/ -MMP)\n+ ,/\n+ a and a Mercer kernel\t\nGiven:)\n-h\u0006\r\fN\f[\u0006\n\u0006\u0012\u0002\nM\u0002\n\n\u0005k\nfor\u0014\n\u000b\f\r\nInitialize:\t\n+ ;\n\n ,\u0001\nT ,\u000e\n\u0093\u0080)\n\n@k\u0010\u000f\n-\u0013\u0006\r\fN\fN\f[\u0006\n\u0003\u0002\n\u0006\t\u0002\nfor\u0014\n/ and%\n\u0012-\nTN\u008a\u0013\u0012\nTN\u008a\nTN\u008a\n\u0093 do\nwhile \u0011\n-h\u0006\r\fN\f[\u0006\nfor for every\u0014\n. do\n\u0003\u0002\n\u0006\t\u0002\n\n@k\n\u0094-\n\n ;\ni\u0014\u000e\n\n ;3\n\n ;\ni\u0016\u001d\nk\u008c\u0012-\n\n ;\ni\u0016\u000e\n+ , else \u0018\nif\u0015\nif\u0015\n!$#\n\u0018\u0018\u0017\n\u0001\u001c\u0003\n\u000b\t\n\n , then set\n\u0001\u001c\u0003\n)\u0002\u0003\n\u0094-\n\u0094-\nT ;\ni\u0014\u000e\nTheorem 4 Given +\ngenerated after\u0015 -th iteration of the \u201cwhile loop\u201d of the\t\nJ[J\n\n\u001aH\u0093\n021\n> . If the algorithm halts after\u0015 -th update, then\nJ[J\n\u001d&%\n\n\u0084.`\nfor every\u001dY5K:<;h=\n\n) and/\nk4m\n\u001f0\u0093\n\u0084@`\n\nJ[J\n021\n\ni\u0014\u000e\n\n\u0084\u0094\u0094-\n\nJNJ\n\nJNJ\n\nJNJ\n\nJNJ\n\nJ[J\n\nD\u001b\u001a\n\nrtq\u001d\u001c\n\n- , else \u0018\n\n+ .\n\nk4m\n\n(11)\n\n(12)\n\nThe proof of the following theorem uses the extended feature space [13].\n\n`\u001a+ . Assume that the vector (\n\n\u0010 was\n/ -MMP learning rule. Then\n\n(10)\n\n\n%\nT\n\n\u0006\n\u001e\n)\n\u0003\n\u0005\n%\n!\n\n^\n\"\nT\n\b\nT\n\u001d\n%\n!\n\u001b\n'\nT\n\u001d\n%\n!\n\u001b\n'\n1\n\f\n\nk\n/\n\u0093\nk\n/\n\b\n.\n0\n\u0083\n\u001d\n\u0006\n/\n0\n1\n\u0083\n\u001d\n\u0006\n/\n\u001d\n%\n!\n\u001b\n'\n\u0093\n\u0002\nk\n/\n(\n%\n\u0093\n`\n\u0015\n\u001a\n\u0093\nk\n/\n>\n\u0006\n\u0007\n`\n\u0001\n\n)\n\t\n\n/\n)\n.\n\\\n)\n\u0003\n\u0001\n!\n\"\n\n\t\n\u0001\n\u001d\n\u0001\n\t\n\u0001\nT\n)\n\u001d\n\n)\n\b\n\n\t\nT\n\nT\n\n)\n\u000f\nT\n\n)\n.\n\b\nX\n\u0002\n2\ni\n)\n\n\u0001\n\u001d\n\u0001\n)\n\u001d\n\n\u0018\nT\n\n\u001d\n\nk\n\u000e\nT\n\b\nT\n\b\n\n\t\nT\n\nT\n\u000f\nT\n\u0015\n\n\u0018\n\u0001\n\u001d\n\u0001\n\u0093\n\n)\n\u0001\n\u001d\n\u0001\n\u0093\nk\n\t\n\u0001\n\ni\nm\n\u001d\n%\n\n\u0018\n\u000e\nT\n\u000f\nT\n\nT\n\n%\n\u000e\n\n#\n>\n#\n\n\u0007\n\u0005\n\n\n!\n#\n\u0004\n\n\u0006\n\u0017\n#\n>\n#\n\n'\n'\n%\n\n'\n\u0005\n\n\n\u000e\n\n\u0015\n\n\u0001\n\u001d\n\u0001\n\u0093\n\u0018\nT\n\n\u0093\n\u0001\n\u001d\n\u0001\n\u0093\nk\n\u000e\nT\n\t\n\u0001\nT\nk\nm\n\u000e\nT\nT\n\n\u001d\n'\n%\n)\n\n%\nT\n\n5\nQ\n\u0006\n\u0015\n2\nk\n/\n)\n\u0093\n\u000f\n-\n\n\u0001\n\u0093\nk\n/\n\b\n.\ni\n-\n\u001f\n\u0093\nk\n/\n\u0015\n0\n\u0083\n\u001d\n\u0006\n/\n\u0083\n\u001d\n\u0006\n/\n-\n\u001d\n%\n!\n\u001b\n'\n\u0093\n\u0002\nk\n/\n(\n%\n\u0093\n`\n-\n\u001f\n\u0093\nk\n/\nk\n\u0015\n)\n\u0093\n\u001a\n\u0093\nk\n\u001f\n\u0093\n/\n-\n!\n\u001b\n'\n\u0093\n\u0002\nk\n/\n(\n%\n\u0093\n`\n\u0001\n!\n\"\n\u0019\n\u0083\n\u001d\n\u0006\n/\ni\n)\n\n\u0093\n\f\n\f- .\n\n+ and \u0016\n\nderived from Theorems 3 and 4:\n\nNote that condition (10) ensures the convergence of the algorithm in a \ufb01nite time. The\n\nabove theorem for/x)\nevery (\n\n+ ensures that solution generated by Algorithm 2 converges to the\n\nthe number of classi\ufb01cation errors (8), can be\n\n(hard) maximum margin classi\ufb01er. Further, it can be shown that the bound (11) holds for\n\nBounds on classi\ufb01cation error The task of \ufb01nding a linear perceptron minimizing the\nnumber of classi\ufb01cation errors on the training set is known to be NP-hard. On this basis\nit is reasonable to expect that \ufb01nding a decision tree or disjunctive normal form of upper\nbounded complexity and minimizing the number of errors is also hard. In this section we\nprovide a lower bound on the number of errors for such classi\ufb01ers.\n\n such that each%\nThe following estimates on \t\f\u000b\r\u000b\n\n , i.e.\n+ and\u001d\u00975xd\nTheorem 5 Let/\n> . If the vector (\n%\u001c)\nafter\u0015 -th iteration of the \u201cwhile loop\u201d of the\t\n\t\f\u000b\r\u000b\nJNJ\n\n\u008d`\nJ[J\nJNJ\nJNJ\nJNJ\nOn the other hand, if (\nhas been generated after\u0015 -th iteration of the \u201cwhile\n%\u0097)\n/ -MMP learning rule, then\nloop\u201d of the\t\n\t\f\u000b\r\u000b\n\n$`\nJNJ\nJNJ\n\n/ -perceptron learning rule, then\n\nhas been generated\n\nJ[J\n\nJNJ\n\nJNJ\n\nJNJ\n\nJ[J\n\nJ[J\n\n5xQ\n\n5nQ\n\nJNJ\n\n(13)\n\n(14)\n\n(15)\n\n- and\n\nJNJ\n\nJ[J\nsuch that \u0016\n\n+ .\n\nin (13), while it is 1 in (14). The following result is derived form\n\nsome recent results of Ben David and Simon [2] on ef\ufb01cient learning of perceptrons.\n\nAdditionally, the estimate (14) holds for every (\n%B)\neach%\nNote that \u0016\nT equals \u0015\n+ and integer3\n+ . There exists an algorithm \u0001\u0003\u0002 which runs in\nTheorem 6 Given \ntime polynomial in both the input dimension\u0087 and the number of training samples.\n-\u0013\u0006\u000e\f[\fN\f[\u0006\ngiven the labelled training sample M\u0002\n\u0006\u0012\b\n\n ,\\\nsuch that\t\u000e\u000b\u000f\u000b\n\t\u000e\u000b\u000f\u000b\n, it outputs a polynomial \u0004x5Y:<;h=\n\n\u000b\u0006 for every in\u001dx5nd\n> .\n>\u0014\u0007\ndfefg\nFollowing [2] we give an explicit formulation of the algorithm \u0001\n\u0002 : for each subset of 2\n\u0010 \ufb01nd the maximal margin hyperplane,\n\u0007\u0006\t\b elements of the training set \u008f\u000b\n\n\u0006\u0012\b\n\u001b\r\f\r\f\r\f\n[3]. Next, de\ufb01ne\u0012\u0010\u000f\npolynomial in both \u000e\n5\u0016Q\n58:<;\u0013=?> .\nwith the lowest error rate on the whole training set. Finally, set \u0004\n\u001eS)\n5 Experimental Results and Discussion\n\nif one exists. Using the standard quadratic programming approach this can be done in time\nas the vector of the hyperplane\n\nand.\n\n, that\n\nX\u0012\u0011\n\nWe have used a standard machine learning benchmark of noisy 7 bit LED display for 10\ndigits, 0 though 9, originally introduced in [4]. We generated 500 examples for training and\n5000 for independent test, under assumption of 10% probability of a bit being reversed.\nThe task set was to discriminate between two classes, digits 0-4 and digits 5-9. Each\n\n\u201cnoisy digit\u201d data vector \u0003L\u000f\u00040\u0006\u000e\f[\fN\f[\fN\u0006\u0012L\u0014\u0013\u0098\n was complemented by an additional 7 bits vector\n\u0094-\n\nto ensure that our Standing Assumption of Section 2 holds true.\n\nL\u0089\u0004\u0085\u0006\r\fN\fN\f[\u0006\u000e-\n\nL\u0014\u0013\u0098\n\n%\n)\n\n%\nT\nT\n`\n%\nT\n)\n\n\u001d\n\u0006\n)\n\u0007\n~\n\n%\nT\n\n\u0010\n\u0006\n\n\u001d\n/\n+\n\u000f\n\u0015\n\u0093\n\u001d\n%\n!\n\u001b\n'\n\u0093\n\u0002\nk\n/\n(\n%\n\u0093\ni\n\u001d\n\u0093\n\u0002\n\u0015\n`\n/\n+\n\u000f\n\u0015\n)\n\u0093\n\u001a\n\u0093\nk\n/\ni\n\u001d\n\u0093\n\u0002\n\u0015\n\f\n\n%\nT\n\n\u0010\n\u0006\n\n\u001d\n/\n+\n\u000f\n-\n\u001d\n%\n!\n\u001b\n'\n\u0093\n\u0002\nk\n/\n(\n%\n\u0093\ni\n\u001d\n\u0093\n\u0002\n\u0015\n`\n/\n+\n\u000f\n-\n\u001f\n\u0093\nk\n/\nk\n\u0015\n)\n\u0093\n\u001a\n\u0093\nk\n\u001f\n\u0093\nk\nm\n/\ni\n\u001d\n\u0093\n\u0002\n\u0015\n\f\n\n%\nT\n\n5\nQ\n\u0010\n%\nT\n)\nT\n`\n%\n\u0007\n\u0007\nT\nT\n)\n.\n>\n\n\u0004\n\n2\n\n\u001d\n~\n\u0005\n+\n\b\nT\nT\n\n\u0001\nT\n\u000b\n\u0004\n\u001b\n\u0013\n\nX\n\n\u0012\n\u000f\n\nX\n\ni\ni\n\f- ,\\\n+ ,\nFor a sake of simplicity we used \ufb01xed complexity weights,\u0003\n3 , and/\n> gives a simple formula for the risk\nwhich for a decision tree\u001dn5nd\n[number of leaves]k\nFour different algorithms have been applied to this data: M\\\u0094\n Decision Trees, version C4.5\n[12] (available from www.cse.unsw.edu.au/ quinlan/), \u0003\\\u0095\\\u0096\n regularized kernel perceptron\nJ[J\n\n ,\nJ[J\n(Algorithm 1) with the generated coef\ufb01cients scaled (\n%R \nwhere\u0015 is the number of updates to the convergence,\u0003\\\u0095\\\n\\\u0094\n greedy maximal margin classi\ufb01er\n(Algorithm 2) and M\\\u0003\u0002z\n mask perceptron [10] which for this data generates a polynomial\n\u001dY5K:<;h=N> using some greedy search heuristics. Table 1 gives the experimental results.\n\n+,\u0006\r\fN\fN\f[\u0006\n[number of errors]\f\n\bz\nJ[J\n\n\tJNJ\n\nTable 1: Results for recognition of two groups of digits on faulty LED-display.\nError rate %: train/test\nAlgorithm\n\nRisk (no. of leaves /SV/terms )\n\nDecision tree\nKernel SVM\nKernel percep.\nMask percep.\n\n31)\n\n110 (4 leaves)\n44.4(413 SV)\n53.1 (294 SV)\n53.2(10 terms)\n\n31)\u0006\u0005\n\n80 (17 leaves)\n40.8 (382 SV)\n54.9 (286 SV)\n49.1 (26 terms)\n\n3\u0011)\n\n21.3 / 22.9\n12.2 / 15.1\n11.8 / 16.3\n12.8 / 15.7\n\n3\u0011)\u0007\u0005\n\n12.0 / 15.8\n11.2 / 14.8\n13.8 / 17.1\n11.8 / 15.6\n\nThe lower bound on risk from maximal margin criterion (Eq. 11) are 44.3 and 40.7 for\n\n\u0004 and3()\b\u0005 , respectively. Similarly, the lower bound on risk from kernel perceptron\n\ncriterion (Eq. 9) were 39.7 and 36.2, respectively. Risks for SVM solutions approach this\nbound and for kernel perceptron they are reasonably close. Comparison with the risks ob-\ntained for decision trees show that our lower bounds are meaningful (for the \u201cun-pruned\u201d\ndecision trees risks were only slightly worse). The mask perceptron results show that sim-\nple (low number of terms) polynomial solutions with risks approaching our lower bounds\ncan be practically found.\n\nThe Bayes-optimal classi\ufb01er can be evaluated on this data set, since we know explicitly the\ndistribution from which data is drawn. Its error rates are 11.2% and 13.8% on the training\nand test sets, respectively. SVM solutions have error rates closest to the Bayesian classi\ufb01er\n\nof the identity into\n\nBoosted Decision Trees An obvious question to ask is what happens if we take a large\nenough linear combination of decision trees. This is the case, for instance, in boosting. We\n\n(the test error rate for31)\t\u0005 exceeds the one of the Bayes-optimal classi\ufb01er by only 7%).\n\n . In a nutshell, the proof relies on the partition\ncan show that:<;\u0013=\n\u0094-\n\n\u000b where \n\n\u000e\r\nC , where the remainder turns out to be a decision tree. This\nand solving this expansion for\u0002\nmeans that in the limit, boosting decision trees \ufb01nds a maximum margin solution in:<;\u0013=\n> ,\na goal more directly achievable via a maximum margin perceptron on:<;\u0013=\n\nis spanned byd\n1\f\u000b\n\nConclusion We have shown that kernel methods with their analytical tools are applicable\nwell outside of their traditional domain, namely in the area of propositional logic, which\ntraditionally has been an area of discrete, combinatorial rather then continuous analytical\nmethods. The constructive lower bounds we proved offer a fresh approach to some seem-\ningly intractable problems. For instance, such bounds can be used as points of reference\nfor practical applications of inductive techniques like as decision trees.\n\nL\u0089\u0004\u0085\u0006\r-\n\n\u0006\r\f\u000e\f\r\f\u000f\u0006\u000e-\n\n> .\n\nT\n)\n)\n)\n~\n0\n1\n\u0083\n\u001d\n\u0006\n/\n\u0084\n)\n0\n\u0083\n\u001d\n\u0006\n/\n\u0084\n)\n(\n%\n\u0015\n\u0001\n\u001d\n%\n!\n\u001b\n'\n\u0002\nk\n/\n(\n%\n\u0093\n\u0004\n\u0004\n3\n)\n>\n~\n>\n\nO\n-\n)\nA\n\n\u000b\nC\n\u0002\n\u0002\n\u0002\n)\ni\ni\nL\n\u0093\ni\nL\n'\n\n\fThe use of Boolean kernels introduced here allows a more insightful comparison of perfor-\nmance of logic based and analytical, linear machine learning algorithms.\n\nThis contributes to the research in the theory of learning systems as illustrated by the result\non existence of polynomial time algorithm for estimation of minimal number of training\nerrors for decision trees and disjunctive normal forms.\n\nA potentially more practical link, to boosted decision trees, and their convergence to the\nmaximum margin solutions has to be investigated further. The current paper sets founda-\ntions for such research.\n\nBoolean kernels can potentially stimulate more accurate (kernel) support vector machines\nby providing more intuitive construction of kernels. This is the subject of ongoing research.\n\nAcknowledgments A.K. acknowledges permission of the Chief Technology Of\ufb01cer, Tel-\nstra to publish this paper. A.S. was supported by a grant of the DFG Sm 62/1-1. Parts of this\nwork were supported by the ARC and an R& D grant from Telstra. Thanks to P. Sember\nand H. Ferra for help in preparation of this paper.\n\nReferences\n[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathe-\n\nmatical Society, 68:337 \u2013 404, 1950.\n\n[2] S. Ben-David and H. U. Simon. Ef\ufb01cient learning of linear perceptron. In T.K. Leen,\nT.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing\nSystems 13, pages 189\u2013195, Cambridge, MA, 2001. MIT Press.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1995.\n[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classi\ufb01cation and Regres-\n\nsion Trees. Wadsworth Int., Belmont, Ca., 1984.\n\n[5] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 \u2013 297,\n\n1995.\n\n[6] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and\nother kernel-based learning methods. Cambridge University Press, Cambridge, 2000.\n[7] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algo-\nrithm. In J. Shavlik, editor, Machine Learning: Proceedings of the Fifteenth Interna-\ntional Conference, San Francisco, CA, 1998. Morgan Kaufmann.\n\n[8] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks archi-\n\ntectures. Neural Computation, 7(2):219\u2013269, 1995.\n\n[9] A. Kowalczyk. Maximal margin perceptron. In A. Smola, P.Bartlett, B. Sch\u00a8olkopf,\nand D. Schuurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 61\u2013100,\nCambridge, MA, 2000. MIT Press.\n\n[10] A. Kowalczyk and H. Ferr`a. Developing higher-order networks with empirically se-\n\nlected units. IEEE Transactions on Neural Networks, 5:698\u2013711, 1994.\n\n[11] A. B. Novikoff. On convergence proofs on perceptrons. Symposium on the Mathe-\n\nmatical Theory of Automata, 12:615\u2013622, 1962.\n\n[12] J.R. Quinlan. Simplifying decision trees. Int. J. Man-Machine Studies, 27:221\u2013234,\n\n(1987).\n\n[13] J. Shawe-Taylor and N. Christianini. Margin distribution and soft margin. In A. J.\nSmola, P. L. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in Large\nMargin Classi\ufb01ers, pages 349\u2013358, Cambridge, MA, 2000. MIT Press.\n\n\f", "award": [], "sourceid": 1958, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}