{"title": "Remarks on Interpolation and Recognition Using Neural Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 939, "page_last": 945, "abstract": null, "full_text": "REMARKS ON INTERPOLATION AND \nRECOGNITION USING NEURAL NETS \n\nEduardo D. Sontag\u00b7 \nSYCON - Center for Systems and Control \nRutgers University \nNew Brunswick, N J 08903 \n\nAbstract \n\nWe consider different types of single-hidden-Iayer feedforward nets: with \nor without direct input to output connections, and using either thresh(cid:173)\nold or sigmoidal activation functions. The main results show that direct \nconnections in threshold nets double the recognition but not the interpo(cid:173)\nlation power, while using sigmoids rather than thresholds allows (at least) \ndoubling both. Various results are also given on VC dimension and other \nmeasures of recognition capabilities. \n\n1 \n\nINTRODUCTION \n\nIn this work we continue to develop the theme of comparing threshold and sigmoidal \nfeedforward nets. In (Sontag and Sussmann, 1989) we showed that the \"general(cid:173)\nized delta rule\" (backpropagation) can give rise to pathological behavior -namely, \nthe existence of spurious local minima even when no hidden neurons are used,(cid:173)\nin contrast to the situation that holds for threshold nets. On the other hand, in \n(Sontag and Sussmann, 1989) we remarked that provided that the right variant be \n'Used, separable sets do give rise to globally convergent back propagation, in com(cid:173)\nplete analogy to the classical perceptron learning theorem. These results and those \nobtained by other authors probably settle most general questions about the case of \nno hidden units, so the next step is to look at the case of single hidden layers. In \n(Sontag, 1989) we announced the fact that sigmoidal activations (at least) double \nrecognition power. Here we provide details, and we make several further remarks \non this as well as on the topic of interpolation. \nNets with one hidden layer are known to be in principle sufficient for arbitrary \nrecognition tasks. This follows from the approximation theorems proved by various \n\n.\"E-mail: sontag@hilbert.rutgers.edu \n\n\f940 \n\nSontag \n\nauthors: (Funahashi,1988), (Cybenko,1989), and (Hornik et. al., 1989). However, \nwhat is far less clear is how many neurons are needed for achieving a given recog(cid:173)\nnition, interpolation, or approximation objective. This is of importance both in its \npractical aspects (having rough estimates of how many neurons will be needed is es(cid:173)\nsential when applying back propagation) and in evaluating generalization properties \n(larger nets tend to lead to poorer generalization). It is known and easy to prove \n(see for instance (Arai, 1989), (Chester, 1990)) that one can basically interpolate \nvalues at any n + 1 points using an n-neuron net, and in particular that any n + 1-\npoint set can be dichotomized by such nets. Among other facts, we point out here \nthat allowing direct input to output connections permits doubling the recognition \npower to 2n, and the same result is achieved if sigmoidal neurons are used but such \ndirect connections are not allowed. Further, we remark that approximate interpo(cid:173)\nlation of 2n - 1 points is also possible, provided that sigmoidal units be employed \n(but direct connections in threshold nets do not suffice). \nThe dimension of the input space (that is, the number of \"input units\") can influ(cid:173)\nence the number of neurons needed, are least for dichotomy problems for suitably \nchosen sets. In particular, Baum had shown some time back (Baum, 1988) that \nthe VC dimension of threshold nets with a fixed number of hidden units is at least \nproportional to this dimension. We give lower bounds, in dimension two, at least \ndoubling the VC dimension if sigmoids or direct connections are allowed. \nLack of space precludes the inclusion of proofs; references to technical reports are \ngiven as appropriate. A full-length version of this paper is also available from the \nauthor. \n\n2 DICHOTOMIES \n\nThe first few definitions are standard. Let N be a positive integer. A dichotomy \nor two-coloring (S_, S+) on a set S ~ JRN is a partition S = S_ U S+ of S into two \ndisjoint subsets. A function I : JRN -. JR will be said to implement this dichotomy \nif it holds that \n\nI(u) > 0 for u E S+ and I(u) < 0 for u E S_ \n\n. \n\nLet F be a class of functions from JRN to JR, assumed to be nontrivial, in the sense \nthat for each point u E lRN there is some 11 E F so that II (u) > 0 and some h E F \nso that h (u) < O. This class shatters the set S ~ RN if each dichotomy on S can \nbe implemented by some I E F. \nHere we consider, for any class of functions F as above, the following measures of \nclassification power. First we introduce J1. and J1., dealing with \"best\" and \"worst\" \ncases respectively: J1.(F) denotes the largest integer I 2: 1 (possibly 00) so that there \nis at least some set S of cardinality I in JRN which can be shattered by F, while \nJ1.(F) is the largest integer I 2: 1 (possibly 00) so that every set of cardinality I can \nbe shattered by F. Note that by definition, J1.(F) ~ /1(F) for every class F. \nIn particular, the definitions imply that no set of cardinality J1.(F) + 1 can be \nshattered, and that there is at least some set of cardinality J1.(F) + 1 which cannot be \nshattered. The integer J1. is usually called the Vapnik-Chervonenkis (VC) dimension \nof the class F (see for instance (Baum,1988)), and appears in formalizations of \nlearning in the distribution-free sense. \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n941 \n\nA set may fail to be shattered by :F because it is very special (see the example \nbelow with colinear points). In that sense, a more robust measure is useful: p,(:F) \nis the largest integer I 2: 1 (possibly 00) for which the class of sets S that can be \nshattered by :F is dense, in the sense that given every I-element set S = {st, ... , S,} \nthere are points Si arbitrarily close to the respective Si'S such that S = {st, ... , s,} \ncan be shattered by :F. Note that \n\np,(:F) ::; p,(:F) ::; 1l(:F) \n\n(1) \n\nfor all :F. \nTo obtain an upper bound m for p,(:F) one needs to exhibit an open class of sets of \ncardinality m + 1 none of which can be shattered. \nTake as an example the class :F consisting of all affine functions f( x) = ax + by + c \non lR 2 \u2022 Since any three poin ts can be shattered by an affine map provided that they \nare not colinear (just choose a line ax + by + c = 0 that separates any poin t which \nis colored different from the rest), it follows that 3 ::; p,. On the other hand, no set \nof four points can ever be dichotomized, which implies that II ::; 3 and therefore the \nconclusion p, = II = 3 for this class. (The negative statement can be verified by a \ncase by case analysis: if the four points form the vertices of a 4-gon color them in \n\"XOR\" fashion, alternate vertices of the same color; if 3 form a triangle and the \nremaining one is inside, color the extreme points differently from the remaining one; \nif all colinear then use an alternating coloring). Finally, since there is some set of 3 \npoints which cannot be dichotomized (any set of three colinear points is like this), \nbut every set of two can, !!:.. = 2 . \nWe shall say that :F is robust if whenever S can be shattered by :F also every small \nenough perturbation of S can be shattered. For a robust class and I = p,(:F), every \nset in an open dense subset in the above topology, i.e. almost every set of 1 elements, \ncan be shattered. \n\n3 NETS \n\nWe define a \"neural net\" as a function of a certain type, corresponding to the idea \nof feedforward interconnections, via additive links, of neurons each of which has a \nscalar response or activation function (). \nDefinition 3.1 Let () : lR ~ lR be any function. A function f : lRN ~ lR is \na single-hidden-Iayer neural net with k hidden neurons of type () and N \ninputs, \nor just a (k , ()-net, if there are real numbers Wo, Wl, \u2022\u2022\u2022 , Wk, 'Tl, \u2022\u2022\u2022 , 'Tk and vectors \nvo, Vi, \u2022 \u2022 \u2022 , Vk E lRN such that, for all u E lRN, \n\nf(u) = Wo + Vo\u00b7U + L Wi()(Vi. U -\n\nk \n\n'Ti) \n\ni=l \n\n(2) \n\nwhere the dot indicates inner product. A net with no direct i/o connections is one \nfor which Vo = O. \n\nFor fixed (), and under mild assumptions on (), such neural nets can be used to \napproximate uniformly arbitrary continuous functions on compacts. In particular, \nthey can be used to implement arbitrary dichotomies. \n\n\f942 \n\nSontag \n\nIn neural net practice, one often takes 9 to be the standard sigmoid u( x) = 1+!-\" or \nequivalently, up to translations and change of coordinates, the hyperbolic tangent \ntanh(x). Another usual choice is the hardlimiter, threshold, or Heaviside function \n\n1t(x) = { ~ \n\nif x ~ 0 \nif x> 0 \n\nwhich can be approximated well by u( \"'(x) when the \"gain\" \"'( is large. Yet another \npossibility is the use of the piecewise linear function \n\n-1 \n\n{ \n\n7r(x) = ! if x ~ -1 \n\nif x> 1 \notherwise. \n\nMost analysis has been done for 1t and no direct connections, but numerical tech(cid:173)\nniques typically use the standard sigmoid (or equivalently tanh). The activation \n1(\" will be useful as an example for which sharper bounds can be obtained. The \nexamples u and 1(\", but not 1t, are particular cases of the following more general \ntype of activation function: \nDefinition 3.2 A function 9 : m. --+ m. will be called a sigmoid if these two prop(cid:173)\nerties hold: \n\n(51) t+ := limx--++oo 9(x) and L := liffix--+-oo 9(x) exist, and t+ =1= t_. \n(52) There is some point c such that 9 is differentiable at c and 9'(c) = JL =1= o. \nAll the examples above lead to robust classes, in the sense defined earlier. More \nprecisely, assume that 9 is continuous except for at most finitely many points x, \nand it is left continuous at such x, and let :F be the class of (k,9)-nets, for any \nfixed k. Then:F is robust, and the same statement holds for nets with no direct \nconnections. \n\n0 \n\n4 CLASSIFICATION RESULTS \nWe let JL(k, 9, N) denote JL(:F), where :F is the class of (k, 9)-nets in m.N with no \ndirect connections, and similarly for JL and JL, and a superscript d is used for the \nclass of arbitrary such nets (with possible direct connections from input to output). \nThe lower measure JL is independent of dimension: \nLemma 4.1 For each k, 9, N, !!:.(k, 9, N) = JL(k, 9,1) and !!:.d(k, 9, N) = JLd(k, 9,1). \nThis justifies denoting these quantities just as JL( k, 9) and JLd( k, 9) respectively, as \nwe do from now on, and giving proofs only for N = 1. \n\n-\n\nLemma 4.2 For any sigmoid 9, and for each k, N, \n\nJL(k + 1,9, N) > JLd(k, 1t, N) \n\nand similarly for JL and JL. \n\nThe main results on classification will be as follows. \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n943 \n\nTheorem 1 For any sigmoid 9, and for each k, \n\n-\n\nJ-L(k,1t) \nJ-Ld(k,1t) \n\nk + 1 \n2k + 2 \n\nJ-L(k,9) > 2k. \n\nTheorem 2 For each k, \n\n4l~J ~ J-L(k, 1t, 2) < 2k + 1 \nJ-Ld(k, 1t, 2) < 4k + 3 . \n\nTheorem 3 For any sigmoid 9, and for each k, \n\n2k + 1 < J-L(k, 1t, 2) \n4k + 3 < ~(k, 1t, 2) \n4k - 1 < J-L(k, 9, 2) . \n\nThese results are proved in (Sontag, 1990a). The first inequality in Theorem 2 \nfollows from the results in (Baum, 1988), who in fact established a lower bound \nof 2N l ~ J for J-L(k, 1t, N) (and hence for J-L too), for every N, not just N = 2 as \nin the ~eorem above. We conjecture, but have as yet been unable to prove, that \ndirect connections or sigmoids should also improve these bounds by at least a factor \nof 2, just as in the two-dimensional case and in the worst-case analysis. Because \nof Lemma 4.2, the last statements in Theorems 1 and 3 are consequences of the \nprevious two. \n\n5 SOME PARTICULAR ACTIVATION FUNCTIONS \n\nConsider the last inequality in Theorem 1. For arbitrary sigmoids, this is far too \nconservative, as the number J-L can be improved considerably from 2k, even made \ninfinite (see below). We conjecture that for the important practical case 9(x) = O'(x) \nit is close to optimal, but the only upper bounds that we have are still too high. \nFor the piecewise linear function 11\", at least, one has equality: \nLemma 5.1 ~(k, 11\") = 2k. \n\nIt is worth remarking that there are sigmoids 9, as differentiable as wanted, even \nreal-analytic, where all classification measures are infinite. Of course, the function \n9 is so complicated that there is no reasonably \"finite\" implementation for it. This \nremark is only of theoretical interest, to indicate that, unless further restrictions \nare made on (S1)-(S2), much better bounds can be obtained. (If only J-L and J-L \nare desired to be infinite, one may also take the simpler example 9( x) = sin( x). \nNote that for any I rationally independent real numbers Xi, the vectors of the form \n(sin(-YIXI), ... , sin(-y,xr), with the 'Yi'S real, form a dense subset of [-1,1]', so all \ndichotomies on {Xl,\"\" xd can be implemented with (1, sin)-nets.) \nLemma 5.2 There is some sigmoid 9, which can be taken to be an analytic func(cid:173)\ntion, so that J-L(1, 9) = 00. \n\n\f944 \n\nSontag \n\n6 \n\nINTERPOLATION \n\nWe now consider the following approximate interpolation problem. Assume given \na sequence of k (distinct) points Xl, \u2022\u2022\u2022 , Xk in RN, any \u00a3 > 0, and any sequence of \nreal numbers YI,\"\" Yk, as well as some class :F of functions from JRN to JR. We \nask if there exists some \n\nI E :F so that I/(xd - yd < \u00a3 for each i. \n\n(3) \nLet ~(:F) be the largest integer k ~ 1, possibly infinite, so that for every set of data \nas above (3) can be solved. Note that, obviously, ~(:F) ~ p,(:F). Just as in Lemma \n4.1, .1. is independent of the dimension N when applied to nets. Thus we let .1.d(k, B) \nand .1.(k, B) be respectively the values of .1.(:F) when applied to (k, B)-nets with or \nwithout direct connections. \nWe now summarize properties of.1.. The next result -see (Sontag,1991), as well \nas the full version of this paper, for a proof- should be compared with Theorem \n1. The main difference is in the second equality. Note that one can prove .1.( k, B) ~ \n~d(k - 1,1l), in complete analogy with the case of p\" but this is not sufficient \nanymore to be able to derive the last inequality in the Theorem from the second \nequality. \n\nTheorem 4 For any continuous sigmoid B, and lor each k, \n\n.1.(k,1l) \n,Ad(k,1l) \n\nk + 1 \nk + 2 \n\n.1.(k, B) > 2k - 1 . \n\nRemark 6.1 Thus we can approximately interpolate any 2k - 1 points using k \nsigmoidal neurons. It is not hard to prove as a corollary that, for the standard \nsigmoid, this approximate interpolation property holds in the following stronger \nsense: for an open dense set of 2k - 1 points, one can achieve an open dense set \nof values; the proof involves looking first at points with rational coordinates, and \nusing that on such points one is dealing basically with rational functions (after a \ndiffeomorphism), plus some theory of semialgebraic sets. We conjecture that one \nshould be able to interpolate at 2k points. Note that for k = 2 this is easy to \nachieve: just choose the slope d so that some Zi - Zi+l becomes zero and the Zi are \nallowed to be nonincreasing or nondecreasing. The same proof, changing the signs \nif necessary, gives the wanted net. For some examples, it is quite easy to get 2k \npoints. For instance, .1.(k,1r) = 2k for the piecewise linear sigmoid 1r. \n0 \n\n7 FURTHER REMARKS \n\nThe main conclusion from Theorem 1 is that sigmoids at least double recognition \npower for arbitrary sets. It may be the case that p,(k, (7, N)j p,(k, 1l, N) ::::::: 2 for all \nN; this is true for N = 1 and is strongly suggested by Theorem 3 (the first bound \nappears to be quite tight). Unfortunately the proof of this theorem is based on a \nresult from (Asano et. al., 1990) regarding arrangements of points in the plane, a \nfact which does not generalize to dimension three or higher. \nOne may also compare the power of nets with and without connections, or threshold \nvs sigmoidal processors, on Boolean problems. For instance, it is a trivial conse(cid:173)\nquence from the given results that parity on n bits can be computed with rni1l \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n945 \n\nhidden sigmoidal units and no direct connections, though requiring (apparently, \nthough this is an open problem) n thresholds. In addition, for some families of \nBoolean functions, the gap between sigmoidal nets and threshols nets may be in(cid:173)\nfinitely large (Sontag, 1990a). See (Sontag, 1990b) for representation properties of \ntwo-hidden-Iayer nets \n\nAcknow ledgements \nThis work was supported in part by Siemens Corporate Research, and in part by \nthe CAIP Center, Rutgers University. \n\nReferences \nArai, M., \"Mapping abilities of three-layer neural networks,\" Proc. IJCNN Int. Joint \nConf.on Neural Networks, Washington, June 18-22, 1989, IEEE Publications, 1989, \npp. 1-419/424. \nAsano,T., J. Hershberger, J. Pach, E.D. Sontag, D. Souivaine, and S. Suri, \"Sepa(cid:173)\nrating Bi-Chromatic Points by Parallel Lines,\" Proceedings of the Second Canadian \nConference on Computational Geometry, Ottawa, Canada, 1990, p. 46-49. \nBaum, E.B., \"On the capabilities of multilayer perceptrons,\" J.Complexity 4(1988): \n193-215. \nChester, D., \"Why two hidden layers and better than one,\" Proc. Int. Joint Conf. \non Neural Networks, Washington, DC, Jan. 1990, IEEE Publications, 1990, p. 1.265-\n268. \nCybenko, G., \"Approximation by superpositions of a sigmoidal function,\" Math. \nControl, Signals, and Systems 2(1989): 303-314. \nFunahashi, K., \"On the approximate realization of continuous mappings by neural \nnetworks,\" Proc. Int. Joint Conf. on Neural Networks, IEEE Publications, 1988, p. \n1.641-648. \nHornik, K.M., M. Stinchcombe, and H. White, \"Multilayer feedforward networks \nare universal approximators,\" Neural Networks 2(1989): 359-366. \nSontag, E.D., \"Sigmoids distinguish better than Heavisides,\" Neural Computation \n1(1989): 470-472. \nSontag, E.D., \"On the recognition capabilities of feedforward nets,\" Report \nSYCON-90-03, Rutgers Center for Systems and Control, April 1990. \nSontag, E.D., \"Feedback Stabilization Using Two-Hidden-Layer Nets,\" Report \nSYCON-90-11, Rutgers Center for Systems and Control, October 1990. \nSontag, E.D., \"Capabilities and training of feedforward nets,\" in Theory and Ap(cid:173)\nplications of Neural Networks (R. Mammone and J. Zeevi, eds.), Academic Press, \nNY, 1991, to appear. \nSontag, E.D., and H.J. Sussmann, \"Back propagation can give rise to spurious local \nminima even for networks without hidden layers,\" Complex Systems 3(1989): 91-\n106. \nSontag, E.D., and H.J. Sussmann, \"Backpropagation separates where perceptrons \ndo,\" Neural Networks(1991), to appear. \n\n\f", "award": [], "sourceid": 436, "authors": [{"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}