{"title": "Observability of Neural Network Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 455, "page_last": 462, "abstract": null, "full_text": "Observability of Neural Network \n\nBehavior \n\nMax Garzon \n\nsarzonmOherme \u2022. msci.mem.t.edu \n\nInstitute for Intelligent Systems \n\nFernanda Botelho \n\nbotelhoflherme \u2022. msci.mem.t.edu \nDepartment of Mathematical Sciences \n\nMemphis State University \nMemphis, TN 38152 U.S.A. \n\nAbstract \n\nWe prove that except possibly for small exceptional sets, discrete(cid:173)\ntime analog neural nets are globally observable, i.e. all their cor(cid:173)\nrupted pseudo-orbits on computer simulations actually reflect the \ntrue dynamical behavior of the network. Locally finite discrete \n(boolean) neural networks are observable without exception. \n\n1 \n\nINTRODUCTION \n\nWe address some aspects of the general problem of implementation and robustness \nof (mainly recurrent) autonomous discrete-time neural networks with continuous \nactivation (herein referred to as analog networks) and discrete activation (herein, \nboolean networks). There are three main sources of perturbations from ideal oper(cid:173)\nation in a neural network. First, the network's parameters may have been contam(cid:173)\ninated with noise from external sources. Second, the network is being implemented \nin optics or electronics (digital or analog) and inherent measurement limitations \npreclude the use of perfect information on the network's parameters. Third, as has \nbeen the most common practice so far, neural networks are simulated or imple(cid:173)\nmented on a digital device, with the consequent limitations on precision to which \nnet parameters can be represented. Finally, for these or other reasons, the activation \nfunctions (e.g. sigmoids) of the network are not known precisely or cannot be evalu(cid:173)\nated properly. Although perhaps negligible in a single iteration, these perturbations \nare likely to accumulate under iteration, even in feedforward nets. Eventually, they \nmay, in fact, distort the results of the implementation to the point of making the \nsimulation useless, if not misleading. \n\n455 \n\n\f456 \n\nGarzon and Botelho \n\nThere is, therefore, an important difference between the intended operation of an \nidealized neural network and its observable behavior. This is a classical problem in \nsystems theory and it has been addressed in several ways. First, the classical no(cid:173)\ntions of distinguishability and observability in control theory (Sontag, 1990) which \nroughly state that every pair of system's states are distinguishable by different out(cid:173)\nputs over evolution in finite time. This is thus a notion of local state observability. \nMore recently, several results have established more global notions of identifiabil(cid:173)\nity of discrete-time feedfoward (Sussmann, 1992; Chen, Lu, Hecht-Nelson, 1993) \nand continuous-time recurrent neural nets (Albertini and Sontag, 1993a,b), which \nroughly state that for given odd activation functions (such as tanh), the weights \nof a neural network are essentially uniquely determined (up to permutation and \ncell redundancies) by the input/output behavior of the network. These notions do \nassume error-free inputs, weights, and activation functions. \n\nIn general, a computer simulation of an orbit of a given dynamical system in the \ncontinuuum (known as a pseudo-orbit) is, in fact, far from the orbit in the ideal \nsystem. Motivated by this problem, Birkhoff introduced the so-called shadowing \nproperty. A system satisfies the shadowing propertyif all pseudo-orbits are uniformly \napproximated by actual orbits so that the former capture the long-term behavior \nof the system. Bowen showed that sufficiently hyperbolic systems in real euclidean \nspaces do have the shadowing property (Bowen, 1978). However, it appears difficult \neven to give a characterization of exactly which maps on the interval possess the \nproperty -see e.g. (Coven, Kan, Yorke, 1988). Precise definitions of all terms can \nbe found in section 2. \n\nBy comparison to state observability and identifiability, the shadowing property is \na type of global observability of a system through its dynamical behavior. Since \nautonomous recurrent networks can be seen as dynamical systems, it is natural \nto investigate this property. Thus, a neural net is observable in the sense that its \nbehavior (i.e. the sequence of its ideal actions on given initial conditions) can be \nobserved on computer simulations or discrete implementations, despite inevitable \nconcomitant approximations and errors. \n\nThe purpose of this paper is to explore this property as a deterministic model for \nperturbations of neural network behavior in the presence of arbitrary small errors \nfrom various sources. The model includes both discrete and analog networks. In \nsection 4 we sketch a proof that locally finite boolean neural networks (even with \nan infinite number of neurons)' are all observable in this sense. This is not true \nin general for analog networks, and section 3 is devoted to sketching necessary and \nsufficient conditions for the relatively few analog exceptions for the most common \ntransfer functions: hard-thresholds, a variety of sigmoids (hyperbolic tangent, lo(cid:173)\ngistic, etc.) and saturated linear maps. Finally, section 5 discusses the results and \nposes some other problems worthy of further research. \n\n2 DEFINITIONS AND MAIN RESULTS \n\nThis section contains notation and precise definitions in a general setting, so as to \ninclude discrete-time networks both with discrete and continuous activations. \nLet f : X ~ X be a continuous map of a compact metric space with metric 1 *, * I. \n\n\fObservability of Neural Network Behavior \n\n457 \n\nThe orbit of x E X is the sequence {x, f(x), ... , fk(x) ... }, i.e. a sequence of points \n{xkh~o for which Xk+ 1 = f(x k ), for all k ~ o. Given a number 6 > 0, a 6-pseudo(cid:173)\norbit is a sequence {xk} so that the distances If(xk), xk+11 < 6 for all k ~ o. \nPseudo-orbits arise as trajectories of ideal dynamical processes contaminated by \nerrors and noise. \nIn such cases, especially when errors propagate exponentially, \nit is important to know when the numerical process is actually representing some \nmeaningful trajectory of the real process. \n\nDefinition 2.1 The map f on a metric space X is (globally) observable (equiva(cid:173)\nlently] has the shadowing property] or is traceable) if and only if for every f > 0 \nthere exists a 6 > 0 so that any 6 -pseudo-orbit {xk} is f-approximated by the orbit] \nunder f] of some point z E X] i.e. Ixk, fk(z) I < f for all k > o. \nOne might observe that computer simulations only run for finite time. On compact \nspaces (as is the case below)' observability can be shown to be equivalent to a similar \nproperty of shadowing finite pseudo-orbits. \n\n'Analog neural network' here means a finite number n of units (or cells), each of \nwhich is characterized by an activation (sometimes called output) function Ui \n: \nR -+ R, and weight matrix W of synaptic strengths between the various units. \nUnits can assume real-valued activations Xi, which are updated synchronously and \nsimultaneously at discrete instants of time, according to the equation \n\nXi(t + 1) \n\n- udL wi,ixi(t)]. \n\ni \n\n(1) \n\nT(x) \n\nThe total activation of the network at any time is hence given by a vector x in \neuclidean space Rn, and the entire network is characterized by a global dynamics \n(2) \nwhere W x denotes ordinary product and u is the map acting as Ui along the ith \ncomponent. This component in a vector x is denoted Xi (as opposed to xk, the kth \nterm of a sequence). The unit hypercube in Rn is denoted In. An analog network \nis then defined as a dynamical system in a finite-dimensional euclidean space and \none may then call a neural network (globally) observable if its global dynamics is an \nobservable dynamical system. Likewise for boolean networks, which will be defined \nprecisely in section 4. \n\nu[W x], \n\nWe end this section with some background facts about observability on the contin(cid:173)\nuum. It is perhaps surprising but a trivial remark that the identity map of the real \ninterval is not observable in this sense, since orbits remain fixed, but pseudo-orbits \nmay drift away from the original state and can, in fact, be dense in the interval. \nLikewise, common activation functions of neural networks (such as hard thresholds \nand logistic maps) are not observable. For linear maps, observability has long been \nknown to be equivalent to hyperbolicity (all eigenvalues>. have 1>'1 =f:. 1). Composi(cid:173)\ntion of observable maps is usually not observable (take, for instance, a hyperbolic \nhomeomorphism and its inverse). In contrast, composition of linear and nonob(cid:173)\nservable activation functions in neural networks are, nevertheless, observable. The \nmain take-home message can be loosely summarized as follows. \n\nTheorem 2.1 Except for a negligible fraction of exceptions, discrete-time analog \nneural nets are observable. All discrete (boolean) neural networks are observable. \n\n\f458 \n\nGarzon and Botelho \n\n3 ANALOG NEURAL NETWORKS \n\nThis section contains (a sketch) of necessary and sufficient conditions for analog \nnetworks to be observable for common types of activations functions. \n\n3.1 HARD-THRESHOLD ACTIVATION FUNCTIONS \n\nIt is not hard to give necessary and sufficient conditions for observability of nets \nwith discrete activation functions of the type \n\n.- { ~ \n\nif 1.\u00a3 ~ Oi \nelse. \n\nwhere Oi is a theshold characterizing cell i. \n\nLemma 3.1 A map 1 : Rn -+ Rn with finite range is observable il and only il it \nis continuous at each point of its range. \n\nPROOF. The condition is clearly sufficient. If 1 is continuous at every point of its \nrange, small enough perturbations X k + 1 of an image I(x k ) have the same image \nI(xk+ l ) = l(f(xk )) and hence, for 8 small enough, every 8-pseudo-orbit is traced \nby the first element of the pseudo-orbit. Conversely, assume 1 is not continuous at a \npoin t of its range 1 (XO). Let xl, x2, ... be a sequence constant under 1 whose image \ndoes not converge to 1(I(xO)) (such a sequence can always be so chosen because \nthe range is discrete). Let \n\nc= ~ min \n2.z,yER ... \n\nI/(x), f(y)l. \n\nFor a given 8 > 0 the pseudo-orbit xo, xk, f(xk), 12(xk), ... is not traceable for k \nlarge enough. Indeed, for any z within \u20ac-distance of xO, either f(z) =I f(xO), in which \ncase this distance is at least \u20ac, or else they coincide, in which case 1/2(z), l(xk)1 > \u20ac \nanyway by the choice of xk. 0 \n\nNow we can apply Lemma 3.1 to obtain the following characterization. \n\nTheorem 3.1 A discrete-time neural net T with weight matrix W := (Wij) and \nthreshold vector 0 is observable if and only ~f for every y in the range 01 T, (W Y)i =I \nOJ for every i (1 ::; i ::; n). \n\n3.2 SIGMOIDAL ACTIVATION FUNCTIONS \n\nIn this section, we establish the observability of arbitrary neural nets with a fairly \ngeneral type of sigmoidal activation functions, as defined next. \n\nDefinition 3.1 A map (j : R -+ R is sigmoidal if it is strictly increasing, bounded \n(above and below), and continuously differentiable. \n\nImportant examples are the logistic map \n\na\u00b7(1.\u00a3) - ---:--~ \n- 1 + exp( -J.L1.\u00a3) , \n, \n\n1 \n\n\fObservability of Neural Network Behavior \n\n459 \n\nthe arctan and the hyperbolic tangent maps \n\nadu) = arctan(J.tu) \n\n, adu) =tanh(u) = \n\nexp(u) - exp(-u) \nexp u + exp -u \n\n() \n\n( ) . \n\nNote that, in particular, the range of a sigmoidal map is an open and bounded \ninterval, which without loss of generality, can be assumed to be the unit interval \nI. Indeed, if a neural net has weight matrix Wand activation function a which is \nconjugate to an activation function a' by a conjugacy ~, then \n\na 0 W --.J a' ~W ~-1 \n\nwhere --.J denotes conjugacy. One can, moreover, assume that the gain factors in the \nsigmoid functions are all J.t = 1 (multiply the rows of W). \n\nTheorem 3.2 Every neural networks with a sigmoidal activation function has a \nstrong attractor, and in particular, it is observable. \n\nPROOF. Let a neural net with n cells have weight matrix Wand sigmoidal a. \nConsider a parametrized family {T,L}\", of nets with sigmoidals given by a\", := J.ta. \nIt is easy to see that each T\", (J.t > 0) is conjugate to T. However, W needs to be \nreplaced by a suitable conjugation with a homeomorphism ~w By Brouwer's fixed \npoint theorem, T,L has a fixed point p* in In. The key idea in the proof is the fact \nthat the dynamics of the network admits a Lyapunov function given by the distance \nfrom p*. Indeed, \n\nII T\",(x) - T,L(P*) II~ sup I JT\", I II x - p* II, \n\ny \n\nwhere J denotes jacobian. Using the chain rule and the fact that the derivatives of \n~'\" and aiL are bounded (say, below by b and above by B), the Jacobian satisfies \n\nIJT,L(Y)I ~ J.tn(bB)nIWI, \n\nwhere IW I denotes the determinant of W. Therefore we can choose J.t small enough \nthat the right-hand side of this expression is less than 1 for arbitrary y, so that T\", \nis a contraction. Thus, the orbit of the first element in any \u20ac-pseudo-orbit \n\u20ac-traces \nthe orbit. 0 \n\n3.3 SATURATED-LINEAR ACTIVATION FUNCTIONS \n\nThe case of the nondifferentiable saturated linear sigmoid given by the piecewise \nlinear map \n\n{ \n\n0, \nu, \nI, \n\nfor u < 0 \n\nfor 0 ~ u ~ 1 \n\nfor u > 1 \n\n(3) \n\npresents some difficulties. First, we establish a general necessary condition for \nobservability, which follows easily for linear maps since shadowing is then equivalent \nto hyperbolicity. \n\nTheorem 3.3 If T leaves a segment of positive length pointwise fixed, then T \nnot observable. \n\nlS \n\n\f460 \n\nGarzon and Botelho \n\nAlthough easy to see in the case of one-dimensional systems due to the fact that \nthe identity map is not observable, a proof in higher dimensions requires showing \nthat a dense pseudo-orbit in the fixed segment is not traceable by points outside \nthe segment. The proof makes use of an auxiliary result. \n\nLemma 3.2 A linear map L : Rn - Rn, acts along the orbit of a point x in the \nunit hypercube either as an attractor to 0, a repellor to infinity, or else as a rigid \nrotation or reflection. \nPROOF. By passing to the complexification L' : en - en of L and then to a \nconjugate, assume without loss of generality that L has a matrix in Jordan canonical \nform with blocks either diagonal or diagonal with the first upper minor diagonal of \nIs. It suffices to show the claim for each block, since the map is a cartesian product \nof the restrictions to the subspaces corresponding to the blocks. First, consider the \ndiagonal case. If the eigenvalues P,I < 1 (P,I > I, respectively), clearly the orbit \nLk(x) _ 0 (II Lk(x) 11- 00). If P,I = I, L acts as a rotation. In the nondiagonal \ncase, it is easy to see that the iterates of x = (XlJ .. \" x m ) are given by \n\nLt(x) \n\nt \n\n.- L (k) ).t-k Xk + 1 + L (k) ).t-k Xk +2 + ... + ).txm' \n\nt-l \n\n(4) \n\nk=O \n\nk=O \n\nThe previous argument for the diagonal block still applies for 1).1 =I 1. If 1).1 = 1 \nand if at least two components of x E In are nonzero, then they are positive and \nagain II L(x) 11- 00. In the remaining case, L acts as a rotation since it reduces to \nmultiplication of a single coordinate of x by).. 0 \nPROOF OF THEOREM 3.3. Assume that T = u 0 Land T leaves invariant a \nsegment xy positive length. Suppose first that L leaves invariant the same segment \nas well. By Lemma 3.2, a pseudo-orbit in the interior of the hypercube In cannot \nbe traced by the orbit of a point in the hypercube. If L moves the segment xy \ninvariant under T, we can aSsume without loss of generality it lies entirely on a \nhyperplane face F of In and the action of u on L(xy) is just a projection over F. \nBut in that case, the action of T on the segment is a (composition of two) linear \nmap(s) and the same argument applies. 0 \nWe point out that, in particular, T may not be observable even if W is hyperbolic. \n\nThe condition in Theorem 3.3 is, in fact, sufficient. The proof is more involved and \nis given in detail in (Garzon & Botelho, 1994). WIth Theorem 3.3 one can then \ndetermine relatively simple necessary and sufficient conditions for observability (in \nterms of the eigenvalues and determinants of a finite number of linear maps). They \nestablish Theorem 2.1 for saturated-linear activation functions. \n\n4 BOOLEAN NETWORKS \n\nThis section contains precise definitions of discrete (boolean) neural networks and \na sketch of the proof that they are observable in general. \n\nDiscrete neural networks have a finite number of activations and their state sets are \nendowed with an addition and multiplication. The activation function OJ (typically \n\n\fObservability of Neural Network Behavior \n\n461 \n\na threshold function) can be given by an arbitrary boolean table, which may vary \nfrom cell to cell. They can, moreover, have an infinite number of cells (the only case \nof interest here, since finite booolean networks are trivially observable). However, \nsince the activation set if is finite, it only makes sense to consider locally finite \nnetworks, for which every cell i only receives input from finitely many others. \n\nA total state is now usually called a configuration. A configuration is best thought \nof as a bi-infinite sequence x := XIX2X3 .\u2022\u2022 consisting of the activations of all cells \nlisted in some fixed order. The space of all configurations is a compact metric space \nif endowed with any of a number of equivalent metrics, such as lx, YI := 2;'\" where \nm = inf{i : Xi =1= Yd. In this metric, a small perturbation of a configuration is \nobtained by changing the values of x at pixels far away from Xl. \n\nThe simplest question about observability in a general space concerns the shadowing \nof the identity function. Observability of the identity happens to be a property \ncharacteristic of configuration spaces. Recall that a totally disconnected topological \nspace is one in which the connected component of every element is itself. \n\nTheorem 4.1 The idenh'ty map id of a compact metric space X is observable iff \nX is totally disconnected. \n\nThe first step in the proof of Theorem 4.3 below is to characterize observability of \nlinear boolean networks (i.e. those obeying the superposition principle). \n\nTheorem 4.2 Every linear continuous map has the shadowing property. \n\nFor the other step we use a global decomposition T = F 0 L of the global dynamics \nof a discrete network as a continuous transformation of configuration space due to \n(Garzon & Franklin, 1990). The reader is referred to (Garzon & Botelho, 1992) for \na detailed proof of all the results in this section. \n\nTheorem 4.3 Every discrete (boolean) neural network is observable. \n\n5 CONCLUSION AND OPEN PROBLEMS \n\nIt has been shown that the particular combination of a linear map with an activa(cid:173)\ntion function is usually globally observable, despite the fact that neither of them \nis observable and the fact that, ordinarily, composition destroys observability. In(cid:173)\ntuitively, this means that observing the input/output behavior of a neural network \nwill eventually give away the true nature of the network's behavior, even if the \nnetwork perturbs its behavior slighlty at each step of its evolution. In simple terms, \nsuch a network cannot fool all the people all of the time. \n\nThe results are valid for virtually every type of autonomous first-order network \nthat evolves in discrete-time, whether the activations are boolean or continuous. \nSeveral results follow from this characterization. For example, in all likelihood there \nexist observable universal neural nets, despite the consequent undecidability of their \ncomputational behavior. Also, neural nets are thus a very natural set of primitives \nfor approximation and implementation of more general dynamical systems. These \nand other consequences will be explored elsewhere (Botelho & Garzon, 1994). \n\n\f462 \n\nGarzon and Botelho \n\nNatural questions arise from these results. First, whether observability is a general \nproperty of most analog networks evolving in continuous time as well. Second, what \nother type of combinations of non observable systems of more general types creates \nobservability, i.e. to what extent neural networks are peculiar in this regard. For \nexample, are higher-order neural networks observable? Those with sigma-pi units? \nFinally, there is the broader question of robustness of neural network implementa(cid:173)\ntions, which bring about inevitable errors in input and/or weights. The results in \nthis paper give a deeper explanation for the touted robustness and fault-tolerance \nof neural network solutions. But, further, they also seem to indicate that it may be \npossible to require that neural net solutions have observable behavior as well, with(cid:173)\nout a tradeoff in the quality of the solution. An exact formulation of this question \nis worthy of further research. \n\nAcknow ledgements \n\nThe work of the first author was partially done while on support from NSF grant \nCCR-9010985 and CNRS-France. \n\nReferences \n\nF. Albertini and E.D. Sontag. (1993) Identifiability of discrete-time neural networks. \nIn Proc. European Control Conference, 460-465. Groningen, The Netherlands: \nSpringer-Verlag. \n\nF. Albertini and E.D. Sontag. (1993) For neural networks, function determines \nform. Neural Networks 6(7): 975-990. \n\nF. Botelho and M. Garzon. (1992) Boolean Neural Nets are Observable, Memphis \nState University: Technical Report 92-18. \n\nF. Botelho and M. Garzon. (1994) Generalized Shadowing Properties. J. Random \nand Computat\u00a3onal Dynamics, in print. \n\nR. Bowen. (1978) On Axiom A diffeomorphisms. In CBMS Regional Conference \nSer\u00a3es \u00a3n Math. 35. Providence, Rhode Island: American Math. Society. \n\nA.M. Chen, H. Lu, and R. Hecht-Nielsen, (1993) On the Geometry of Feedforward \nNeural Network Error Surfaces. Neural Computat\u00a3on 5(6): 910-927. \n\nE. Coven, 1. Kan, and J. Yorke. (1988) Pseudo-orbit shadowing in the family of \ntent maps. Trans. AMS 308: 227-241. \n\nM. Garzon and S. P. Franklin. \nComplex Systems 4(5): 509-518. \n\n(1990) Global dynamics in neural networks II. \n\nM. Garzon and F. Botelho. (1994) Observability of Discrete-time Analog Networks, \npreprint. \n\nE.D. Sontag. \nDimens\u00a3onal Dynam\u00a3cal Systems. New York: Springer-Verlag. \n\n(1990) Mathemat~\u00b7cal Control Theory: Deterministic Fin\u00a3te-\n\nH. Sussmann. (1992) Uniqueness of the Weights for Minimal Feedforward Nets with \na Given Input-Output Map. Neural Networks 5(4): 589-593. \n\n\f", "award": [], "sourceid": 806, "authors": [{"given_name": "Max", "family_name": "Garzon", "institution": null}, {"given_name": "Fernanda", "family_name": "Botelho", "institution": null}]}