{"title": "Sample Complexity for Learning Recurrent Perceptron Mappings", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 210, "abstract": null, "full_text": "Sample Complexity for Learning \nRecurrent Percept ron Mappings \n\nBhaskar Dasgupta \n\nEduardo D. Sontag \n\nDepartment of Computer Science \n\nUniversity of Waterloo \n\nWaterloo, Ontario N2L 3G 1 \n\nDepartment of Mathematics \n\nRutgers University \n\nNew Brunswick, NJ 08903 \n\nCANADA \n\nUSA \n\nbdasgupt~daisy.uwaterloo.ca \n\nsontag~control.rutgers.edu \n\nAbstract \n\nRecurrent perceptron classifiers generalize the classical perceptron \nmodel. They take into account those correlations and dependences \namong input coordinates which arise from linear digital filtering. \nThis paper provides tight bounds on sample complexity associated \nto the fitting of such models to experimental data. \n\n1 \n\nIntroduction \n\nOne of the most popular approaches to binary pattern classification, underlying \nmany statistical techniques, is based on perceptrons or linear discriminants; see \nfor instance the classical reference (Duda and Hart, 1973). In this context, one is \ninterested in classifying k-dimensional input patterns \n\nV=(Vl, . . . ,Vk) \n\ninto two disjoint classes A + and A -. A perceptron P which classifies vectors into \nA + and A -\nis characterized by a vector (of \"weights\") C E lR k, and operates as \nfollows. One forms the inner product \n\nC.V = CIVI + ... CkVk . \n\nIf this inner product is positive, v is classified into A +, otherwise into A - . \nIn signal processing and control applications, the size k of the input vectors v is \ntypically very large, so the number of samples needed in order to accurately \"learn\" \nan appropriate classifying perceptron is in principle very large. On the other hand, \nin such applications the classes A + and A-often can be separated by means of a \ndynamical system of fairly small dimensionality. The existence of such a dynamical \nsystem reflects the fact that the signals of interest exhibit context dependence and \n\n\fSample Complexity for Learning Recurrent Perceptron Mappings \n\n205 \n\ncorrelations, and this prior information can help in narrowing down the search for a \nclassifier. Various dynamical system models for classification appear from instance \nwhen learning finite automata and languages (Giles et. al., 1990) and in signal \nprocessing as a channel equalization problem (at least in the simplest 2-level case) \nwhen modeling linear channels transmitting digital data from a quantized source, \ne.g. (Baksho et. al., 1991) and (Pulford et. al., 1991). \nWhen dealing with linear dynamical classifiers, the inner product c. v represents \na convolution by a separating vector c that is the impulse-response of a recursive \ndigital filter of some order n ~ k. Equivalently, one assumes that the data can \nbe classified using a c that is n-rec'Ursive, meaning that there exist real numbers \nTI, ... , Tn SO that \n\nn \n\nCj = 2: Cj-iTi, j = n + 1, ... , k . \n\ni=1 \n\nSeen in this context, the usual perceptrons are nothing more than the very special \nsubclass of \"finite impulse response\" systems (all poles at zero); thus it is appropri(cid:173)\nate to call the more general class \"recurrent\" or \"IIR (infinite impulse response)\" \nperceptrons. Some authors, particularly Back and Tsoi (Back and Tsoi, 1991; Back \nand Tsoi, 1995) have introduced these ideas in the neural network literature. There \nis also related work in control theory dealing with such classifying, or more generally \nquantized-output, linear systems; see (Delchamps, 1989; Koplon and Sontag, 1993). \n\nThe problem that we consider in this paper is: if one assumes that there is an \nn-recursive vector c that serves to classify the data, and one knows n but not \nthe particular vector, how many labeled samples v(i) are needed so as to be able \nto reliably estimate C? More specifically, we want to be able to guarantee that \nany classifying vector consistent with the seen data will classify \"correctly with \nhigh probability\" the unseen data as well. This is done by computing the VC \ndimension of the related concept class and then applying well-known results from \ncomputational learning theory. Very roughly speaking, the main result is that \nthe number of samples needed is proportional to the logarithm of the length k (as \nopposed to k itself, as would be the case if one did not take advantage of the recurrent \nstructure). Another application of our results, again by appealing to the literature \nfrom computational learning theory, is to the case of \"noisy\" measurements or more \ngenerally data not exactly classifiable in this way; for example, our estimates show \nroughly that if one succeeds in classifying 95% of a data set of size logq, then with \nconfidence ~ lone is assured that the prediction error rate will be < 90% on future \n(unlabeled) samples. \n\nSection 5 contains a result on polynomial-time learnability: for n constant, the \nclass of concepts introduced here is PAC learnable. Generalizations to the learning \nof real-valued (as opposed to Boolean) functions are discussed in Section 6. For \nreasons of space we omit many proofs; the complete paper is available by electronic \nmail from the authors. \n\n2 Definitions and Statements of Main Results \n\nGiven a set X, and a subset X of X, a dichotomy on X is a function \n\nfJ: X - {-I, I}. \n\n{-I, I}, to be called the class of classifier \nAssume given a class F of functions X -\nfunctions. The subset X ~ X is shattered by F if each dichotomy on X is the \nrestriction to X of some
O and q~O. A sequence \n\nC= (Cl, ... , cn+q) E lR.n+q \n\nis said to be n-recursive if there exist real numbers r1, .. . , rn so that \n\nn \n\ncn+j = 2: cn+j-iri, j = 1, . .. , q. \n\ni=l \n\n(In particular, every sequence of length n is n-recursive, but the interesting cases \nare those in which q i= 0, and in fact q ~ n.) Given such an n-recursive sequence \nC, we may consider its associated perceptron classifier. This is the map \n\n\u00a2c: lR.n+q --+{-1,1}: \n\n(X1, ... ,Xn+q) H \n\nsign (I:CiXi) \n\n.=1 \n\nwhere the sign function is understood to be defined by sign (z) = -1 if z ~ 0 and \nsign (z) = 1 otherwise. (Changing the definition at zero to be + 1 would not change \nthe results to be presented in any way.) We now introduce, for each two fixed n, q \nas above, a class of functions: \n\n:Fn,q := {\u00a2cl cE lR.n+q is n-recursive}. \n\nThis is understood as a function class with respect to the input space X = lR. n +q, \nand we are interested in estimating vc (:Fn,q). \nOur main result will be as follows (all logs in base 2): \n\nTheorem 1 \nImax {n, nLlog(L1 + ~ J)J} ~ vc (:Fn ,q) ~ min {n + q, 18n + 4n log(q + 1)} I \n\nNote that, in particular, when q> max{2 + n 2 , 32}, one has the tight estimates \n\nn \"2 logq ~ vc (:Fn ,q) ~ 8n logq . \n\nThe organization of the rest of the paper is as follows. In Section 3 we state an \nabstract result on VC-dimension, which is then used in Section 4 to prove Theo(cid:173)\nrem 1. Finally, Section 6 deals with bounds on the sample complexity needed for \nidentification of linear dynamical systems, that is to say, the real-valued functions \nobtained when not taking \"signs\" when defining the maps \u00a2c. \n\n3 An Abstract Result on VC Dimension \n\nAssume that we are given two sets X and A, to be called in this context the set of \ninputs and the set of parameter values respectively. Suppose that we are also given \na function \n\nF: AxX--+{-1,1}. \n\nAssociated to this data is the class of functions \n\n:F := {F(A,\u00b7): X --+ {-1, 1} I A E A} \n\n\fSample Complexity for Learning Recurrent Perceptron Mappings \n\n207 \n\nobtained by considering F as a function of the inputs alone, one such function for \neach possible parameter value A. Note that, given the same data one could, dually, \nstudy the class \n\nF*: {F(-,~) : A-{-I,I}I~EX} \n\nwhich obtains by fixing the elements of X and thinking of the parameters as inputs. \nIt is well-known (and in any case, a consequence of the more general result to be \npresented below) that vc (F) ~ Llog(vc (F*\u00bbJ, which provides a lower bound on \nvc (F) in terms of the \"dual VC dimension.\" A sharper estimate is possible when \nA can be written as a product of n sets \n\nA = Al X A2 X \u2022 \u2022 . x An \n\n(1) \n\nand that is the topic which we develop next. \nWe assume from now on that a decomposition of the form in Equation (1) is given, \nand will define a variation of the dual VC dimension by asking that only certain \ndichotomies on A be obtained from F*. We define these dichotomies only on \"rect(cid:173)\nangular\" subsets of A, that is, sets of the form \n\nL = Ll X .\u2022. x Ln ~ A \n\nwith each Li ~ Ai a nonempty subset. Given any index 1 ::; K ::; n, by a K-axis \ndichotomy on such a subset L we mean any function 6 : L - {-I, I} which depends \nonly on the Kth coordinate, that is, there is some function \u00a2 : Lit - {-I, I} so that \n6(Al, . . . ,An ) = \u00a2(AIt) for all (Al, . . . ,An ) E L; an axis dichotomy is a map that \nis a K-axis dichotomy for some K. A rectangular set L will be said to be axis(cid:173)\nshattered if every axis dichotomy is the restriction to L of some function of the form \nF(\u00b7,~): A - {-I, I}, for some ~ EX. \n\nTheorem 2 If L = Ll X ... x Ln ~ A can be axis-shattered and each set Li has \ncardinality ri, then vc (F) ~ Llog(rt)J + ... + Llog(rn)J . \n\n(In the special case n=1 one recovers the classical result vc (F) ~ Llog(vc (F*)J.) \nThe proof of Theorem 2 is omitted due to space limitations. \n\n4 Proof of Main Result \n\nWe recall the following result; it was proved, using Milnor-Warren bounds on the \nnumber of connected components of semi-algebraic sets, by Goldberg and Jerrum: \n\nFact 4.1 (Goldberg and Jerrum, 1995) Assume given a function F : A x X -\n{-I, I} and the associated class of functions F:= {F(A,\u00b7): X - {-I, I} I A E A} . \nSuppose that A = ~ k and X = ~ n, and that the function F can be defined in terms \nof a Boolean formula involving at most s polynomial inequalities in k + n variables, \neach polynomial being of degree at most d. Then, vc (F) ::; 2k log(8eds). \n0 \n\nUsing the above Fact and bounds for the standard \"perceptron\" model, it is not \ndifficult to prove the following Lemma. \nLemma 4.2 vc (Fn,q) ::; min{n + q, 18n + 4nlog(q + I)} \n\nNext, we consider the lower bound of Theorem 1. \nLemma 4.3 vc (Fn,q) ~ maxin, nLlog(Ll + q~1 J)J} \n\n\f208 \n\nB. DASGUPTA, E. D. SONTAG \n\nProof As Fn,q contains the class offunctions 0, the consistency problem for :Fn,q can be solved \nin time polynomial in q and s in the unit cost model, and time polynomial in q, s, \nand L in the logarithmic cost model. \nSince vc (:Fn ,q) = O(n + nlog(q + 1)), it follows from here that the class :Fn,q is \nlearnable in time polynomial in q (and L in the log model). Due to space limitations, \nwe must omit the proof; it is based on the application of recent results regarding \ncomputational complexity aspects of the first-order theory of real-closed fields. \n\n6 Pseudo-Dimension Bounds \n\nIn this section, we obtain results on the learnability of linear systems dynamics, that \nis, the class of functions obtained if one does not take the sign when defining recur(cid:173)\nrent perceptrons. The connection between VC dimension and sample complexity \nis only meaningful for classes of Boolean functions; in order to obtain learnability \nresults applicable to real-valued functions one needs metric entropy estimates for \ncertain spaces of functions. These can be in turn bounded through the estimation \nof Pollard's pseudo-dimension. We next briefly sketch the general framework for \nlearning due to Haussler (based on previous work by Vapnik, Chervonenkis, and \nPollard) and then compute a pseudo-dimension estimate for the class of interest. \n\nThe basic ingredients are two complete separable metric spaces X and If (called \nrespectively the sets of inputs and outputs), a class :F of functions f : X -+ If \n(called the decision rule or hypothesis space) , and a function f : If x If -+ [0, r] C jR \n(called the loss or cost function). The function f is so that the class of functions \n(x, y) ~ f(f(x), y) is \"permissible\" in the sense of Haussler and Pollard. \nNow, \none may introduce, for each f E :F, the function \n\nAJ,l : X x If x jR -+ {-I, I} : (x, y, t) ~ sign (f(f(x) , y) - t) \n\nas well as the class A.1\",i consisting of all such A/,i ' The pseudo-dimension of :F \nwith respect to the loss function f, denoted by PO [:F, f], is defined as: \n\nPO [:F,R] := vc (A.1\",i). \n\nDue to space limitations, the relationship between the pseudo-dimension and the \nsample complexity of the class :F will not be discussed here; the reader is referred \nto the references (Haussler, 1992; Maass, 1994) for details. \n\nFor our application we define , for any two nonnegative integers n, q, the class \n\n:F~,q := {\u00a2