{"title": "Adaptive knot Placement for Nonparametric Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 247, "page_last": 254, "abstract": null, "full_text": "Grammatical Inference by \n\nAttentional Control of Synchronization \n\nin an Oscillating Elman Network \n\nBill Baird \n\nDept Mathematics, \n\nU.C.Berkeley, \n\nBerkeley, Ca. 94720, \n\nbaird@math.berkeley.edu \n\nTodd Troyer \nDept of Phys., \n\nU.C.San Francisco, \n513 Parnassus Ave. \n\nSan Francisco, Ca. 94143, \n\ntodd@phy.ucsf.edu \n\nFrank Eeckman \nLawrence Livermore \nNational Laboratory, \nP.O. Box 808 (L-270), \nLivermore, Ca. 94550, \n\neeckman@.llnl.gov \n\nAbstract \n\nWe show how an \"Elman\" network architecture, constructed from \nrecurrently connected oscillatory associative memory network mod(cid:173)\nules, can employ selective \"attentional\" control of synchronization \nto direct the flow of communication and computation within the \narchitecture to solve a grammatical inference problem. \nPreviously we have shown how the discrete time \"Elman\" network \nalgorithm can be implemented in a network completely described \nby continuous ordinary differential equations. The time steps (ma(cid:173)\nchine cycles) of the system are implemented by rhythmic variation \n(clocking) of a bifurcation parameter. In this architecture, oscilla(cid:173)\ntion amplitude codes the information content or activity of a mod(cid:173)\nule (unit), whereas phase and frequency are used to \"softwire\" the \nnetwork. Only synchronized modules communicate by exchang(cid:173)\ning amplitude information; the activity of non-resonating modules \ncontributes incoherent crosstalk noise. \nAttentional control is modeled as a special subset of the hidden \nmodules with ouputs which affect the resonant frequencies of other \nhidden modules. They control synchrony among the other mod(cid:173)\nules and direct the flow of computation (attention) to effect transi(cid:173)\ntions between two subgraphs of a thirteen state automaton which \nthe system emulates to generate a Reber grammar. The internal \ncrosstalk noise is used to drive the required random transitions of \nthe automaton. \n\n67 \n\n\f68 \n\nBaird, Troyer, and Eeckman \n\n1 \n\nIntroduction \n\nRecordings of local field potentials have revealed 40 to 80 Hz oscillation in vertebrate \ncortex [Freeman and Baird, 1987, Gray and Singer, 1987]. The amplitude patterns \nof such oscillations have been shown to predict the olfactory and visual pattern \nrecognition responses of a trained animal. There is further evidence that although \nthe oscillatory activity appears to be roughly periodic, it is actually chaotic when \nexamined in detail. This preliminary evidence suggests that oscillatory or chaotic \nnetwork modules may form the cortical substrate for many of the sensory, motor, \nand cognitive functions now studied in static networks. \nIt remains be shown how networks with more complex dynamics can performs these \noperations and what possible advantages are to be gained by such complexity. We \nhave therefore constructed a parallel distributed processing architecture that is in(cid:173)\nspired by the structure and dynamics of cerebral cortex, and applied it to the prob(cid:173)\nlem of grammatical inference. The construction views cortex as a set of coupled \noscillatory associative memories, and is guided by the principle that attractors must \nbe used by macroscopic systems for reliable computation in the presence of noise. \nThis system must function reliably in the midst of noise generated by crosstalk from \nit's own activity. Present day digital computers are built of flip-flops which, at the \nlevel of their transistors, are continuous dissipative dynamical systems with differ(cid:173)\nent attractors underlying the symbols we call \"0\" and \"1\". In a similar manner, the \nnetwork we have constructed is a symbol processing system, but with analog input \nand oscillatory subsymbolic representations. \n\nThe architecture operates as a thirteen state finite automaton that generates the \nsymbol strings of a Reber grammar. It is designed to demonstrate and study the \nfollowing issues and principles of neural computation: (1) Sequential computation \nwith coupled associative memories. (2) Computation with attractors for reliable \noperation in the presence of noise. (3) Discrete time and state symbol processing \narising from continuum dynamics by bifurcations of attractors. (4) Attention as \nselective synchronization controling communication and temporal program flow. (5) \nchaotic dynamics in some network modules driving randomn choice of attractors in \nother network modules. The first three issues have been fully addressed in a previous \npaper [Baird et. al., 1993], and are only briefly reviewed. \".le focus here on the last \ntwo. \n\n1.1 Attentional Processing \n\nAn important element of intra-cortical communication in the brain, and between \nmodules in this architecture, is the ability of a module to detect and respond to \nthe proper input signal from a particular module, when inputs from other modules \nirrelevant to the present computation are contributing crosstalk noise. This is smilar \nto the problem of coding messages in a computer architecture like the Connection \nMachine so that they can be picked up from the common communication buss line \nby the proper receiving module. \nPeriodic or nearly periodic (chaotic) variation of a signal introduces additional de(cid:173)\ngrees of freedom that can be exploited in a computational architecture. We investi(cid:173)\ngate the principle that selective control of synchronization, which we hypopthesize \nto be a model of \"attention\", can be used to solve this coding problem and control \ncommunication and program flow in an architecture with dynamic attractors. \n\nThe architecture illust.rates the notion that synchronization not only \"binds\" sen-\n\n\fGrammatical Inference by Attentional Control of Synchronization \n\n69 \n\nsory inputs into \"objects\" [Gray and Singer, 1987], but binds the activity of selected \ncortical areas into a functional whole that directs behavior. It is a model of \"at(cid:173)\ntended activity\" as that subset which has been included in the processing of the \nmoment by synchronization. This is both a spatial and temporal binding. Only the \ninputs which are synchronized to the internal oscillatory activity of a module can \neffect previously learned transitions of at tractors within it. For example, consider \ntwo objects in the visual field separately bound in primary visual cortex by synchro(cid:173)\nnization of their components at different phases or frequencies. One object may be \nselectively attended to by its entrainment to oscillatory processing at higher levels \nsuch as V4 or IT. These in turn are in synchrony with oscillatory activity in motor \nareas to select the attractors there which are directing motor output. \nIn the architecture presented here, we have constrained the network dynamics so \nthat there exist well defined notions of amplitude, phase, and frequency. The net(cid:173)\nwork has been designed so that amplitude codes the information content or activity \nof a module, whereas phase and frequency are used to \"softwire\" the network. An \noscillatory network module has a passband outside of which it will not synchro(cid:173)\nnize with an oscillatory input. Modules can therefore easily be de synchronized \nby perturbing their resonant frequencies. Furthermore, only synchronized modules \ncommunicate by exchanging amplitude information; the activity of non-resonating \nmodules contributes incoherant crosstalk or noise. The flow of communication be(cid:173)\ntween modules can thus be controled by controlling synchrony. By changing the \nintrinsic frequency of modules in a patterned way, the effective connectivity of the \nnetwork is changed. The same hardware and connection matrix can thus subserve \nmany different computations and patterns of interaction between modules without \ncrosstalk problems. \nThe crosstalk noise is actually essential to the function of the system. It serves as \nthe noise source for making random choices of output symbols and automaton state \ntransitions in this architecture, as we discuss later. In cortex there is an issue as to \nwhat may constitute a source of randomness of sufficient magnitude to perturb the \nlarge ensemble behavior of neural activity at the cortical network level. It does not \nseem likely that the well known molecular fluctuations which are easily averaged \nwithin one or a few neurons can do the job. The architecture here models the \nhypothesis that deterministic chaos in the macroscopic dynamics of a network of \nneurons, which is the same order of magnitude as the coherant activity, can serve \nthis purpose. \nIn a set of modules which is desynchronized by perturbing the resonant frequencies \nof the group, coherance is lost and \"random\" phase relations result. The character \nof the model time traces is irregular as seen in real neural ensemble activity. The be(cid:173)\nhavior of the time traces in different modules of the architecture is similar to the tem(cid:173)\nporary appearance and switching of synchronization between cortical areas seen in \nobservations of cortical processing during sensory/motor tasks in monkeys and hu(cid:173)\nmans [Bressler and Nakamura, 1993]. The structure of this apparently chaotic sig(cid:173)\nnal and its use in network learning and operation are currently under investigation. \n\n2 Normal Form Associative Memory Modules \n\nThe mathematical foundation for the construction of network modules is contained \nin the normal form projection algorithm [Baird and Eeckman, 1993]. This is a \nlearning algorithm for recurrent analog neural networks which allows associative \nmemory storage of analog patterns, continuous periodic sequences, and chaotic \n\n\f70 \n\nBaird, Troyer, and Eeckman \n\nattractors in the same network. An N node module can be shown to function \nas an associative memory for up to N /2 oscillatory, or N /3 chaotic memory at(cid:173)\ntractors [Baird and Eeckman, 1993]. A key feature of a net constructed by this \nalgorithm is that the underlying dynamics is explicitly isomorphic to any of a \nclass of standard, well understood nonlinear dynamical systems - a normal form \n[Guckenheimer and Holmes, 1983]. \nThe network modules of this architecture were developed previously as models of \nolfactory cortex with distributed patterns of activity like those observed experimen(cid:173)\ntally [Baird, 1990, Freeman and Baird, 1987]. Such a biological network is dynami(cid:173)\ncally equivalent to a network in normal form and may easily be designed, simulated, \nand theoretically evaluated in these coordinates. When the intramodule competi(cid:173)\ntion is high, they are \"memory\" or winner-take-all cordinates where attractors have \none oscillator at maximum amplitude, with the other amplitudes near zero. In fig(cid:173)\nure two, the input and output modules are demonstrating a distributed amplitude \npattern ( the symbol \"T\"), and the hidden and context modules are two-attractor \nmodules in normal form coordinates showing either a right or left side active. \nIn this paper all networks are discussed in normal form coordinates. By analyz(cid:173)\ning the network in these coordinates, the amplitude and phase dynamics have a \nparticularly simple interaction. When the input to a module is synchronized with \nits intrinsic oscillation, the amplitude of the periodic activity may be considered \nseparately from the phase rotation. The module may then be viewed as a static \nnetwork with these amplitudes as its activity. \nTo illustrate the behavior of individ ualnetwork modules, we examine a binary (two(cid:173)\nattractor) module; the behavior of modules with more than two attractors is similar. \nSuch a unit is defined in polar normal form coordinates by the following equations \nof the Hopf normal form: \n\nrli \n\nrOi \n\nOli \n\nOOi \n\nj \n\n1l.irli - Cdi + (d - bsin(wclockt))rlir5i + L wtlj cos(Oj - Oli) \n1l.jr Oi - crgi + (d - bsin(wclockt))roirii + L wijlj cos(Oj - OOi) \nWi + L wt(Ij /1\u00b7li) sin(Oj - Oli) \nWi + L wij(Ij/rOi) sin(Oj - OOi) \n\nj \n\nj \n\nj \n\nThe clocked parameter bsin(wclockt) is used to implement the discrete time machine \ncycle of the Elman architecture as discussed later. It has lower frequency (1/10) \nthan the intrinsic frequency of the unit Wi. \nExamination of the phase equations shows that a unit has a strong tendency \nto synchronize with an input of similar frequency. Define the phase difference \ncp = 00 - OJ = 00 - wJt between a unit 00 and it's input OJ. For either side of a \nat zero phase difference cp = 00 - OJ = \u00b0 and a repellor at cp = 180 degrees. In \nunit driven by an input of the same frequency, WJ = Wo, There is an attractor \n\nsimulations, the interconnected network of these units described below synchro(cid:173)\nnizes robustly within a few cycles following a perturbation. If the frequencies of \nsome modules of the architecture are randomly dispersed by a significant amount, \nWJ - Wo #- 0, phase-lags appear first, then synchronization is lost in those units. An \noscillating module therefore acts as a band pass filter for oscillatory inputs. \n\n\fGrammatical Inference by Attentional Control of Synchronization \n\n71 \n\nWhen the oscillators are sychronized with the input, OJ - Oli = 0, the phase terms \ncos(Oj - Oli) = cos(O) = 1 dissappear. This leaves the amplitude equations rli \nand rOi with static inputs E j wt;Ij and E j wijlj. Thus we have network modules \nwhich emulate static network units in their amplitude activity when fully phase(cid:173)\nlocked to their input. Amplitude information is transmitted between modules, with \nan oscillatory carrier. \nFor fixed values of the competition, in a completely synchronized system, the in(cid:173)\nternal amplitude dynamics define a gradient dynamical system for a fourth order \nenergy fUllction. External inputs that are phase-locked to the module's intrinsic \noscillation simply add a linear tilt to the landscape. \nFor low levels of competition, there is a broad circular valley. When tilted by \nexternal input, there is a unique equilibrium that is determined by the bias in tilt \nalong one axis over the other. Thinking of Tli as the \"acitivity\" of the unit, this \nacitivity becomes a monotonically increasing function of input. The module behaves \nas an analog connectionist unit whose transfer function can be approximated by a \nsigmoid. We refer to this as the \"analog\" mode of operation of the module. \nWith high levels of competition, the unit will behave as a binary (bistable) digital \nflip-flop element. There are two deep potential wells, one on each axis. Hence the \nmodule performs a winner-take-all choice on the coordinates of its initial state and \nmaintains that choice \"clamped\" and independent of external input. This is the \n\"digital\" or \"quantized\" mode of operation of a module. We think of one attractor \nwithin the unit as representing \"1\" (the right side in figure two) and the other as \nrepresenting \"0\" . \n\n3 Elman Network of Oscillating Associative Memories \n\nPreviously we con-\n\nAs a benchmark for the capabilities of the system, and to create a point of con(cid:173)\ntact to standard network architectures, we have constructed a discrete-time recur(cid:173)\nrent \"Elman\" network [Elman, 1991] from oscillatory modules defined by ordinary \ndifferential equations. \nstructed a system which functions as the six Figure 1. \nstate finite automaton that perfectly recog-\nnizes or generates the set of strings defined by \nthe Reber grammar described in Cleeremans \net. al. \n[Cleeremans et al., 1989]. We found \nthe connections for this network by using the \nbackpropagation algorithm in a static network \nthat approximates the behavior of the ampli(cid:173)\ntudes of oscillation in a fully synchronized dy(cid:173)\nnamic network [Baird et al., 1993]. \nHere we construct a system that emulates \nthe larger 13 state automata similar (less one \nstate) to the one studied by Cleermans, et al \nin the second part of their paper. The graph \nof this automaton consists of two subgraph \nbranches each of which has the graph struc(cid:173)\nture of the automaton learned as above, but \nwith different assignments of transition out-\nput symbols (see fig. 1). \n\ns \n\nT \n\n\f72 \n\nBaird, Troyer, and Beckman \n\nWe use two types of modules in implementing the Elman network architecture shown \nin figure two below. The input and output layer each consist of a single associative \nmemory module with six oscillatory attractors (six competing oscillatory modes), \none for each of the six symbols in the grammar. The hidden and context layers \nconsist of the binary \"units\" above composed of a two oscillatory attractors. The \narchitecture consists of 14 binary modules ill the hidden and context layers - three \nof which are special frequency control modules. The hidden and context layers are \ndivided into four groups: the first three correspond to each of the two subgraphs plus \nthe start state, and the fourth group consists of three special control modules, each \nof which has only a special control output that perturbs the resonant frequencies of \nthe modules (by changing their values in the program) of a particular state coding \ngroup when it is at the zero attractor, as illustrated by the dotted control lines in \nfigure two. This figure shows control unit two is at the one attractor (right side \nof the square active) and the hidden units coding for states of subgraph two are \nin synchrony with the input and output modules. Activity levels oscillate up and \ndown through the plane of the paper. Here in midcycle, competition is high in all \nmodules. \n\nFigure 2. \n\nOSCILLATING ELMAN NETWORK \n\nOUTPUT \n\nINPUT \n\nThe discrete machine cycle of the Elman algorithm is implemented by the sinusoidal \nvariation (clocking) of the bifurcation parameter in the normal form equations that \ndetermines the level of intramodule competition [Baird et al., 1993]. At the begin(cid:173)\nning of a machine cycle, when a network is generating strings, the input and context \nlayers are at high competition and their activity is clamped at the bottom of deep \nbasins of attraction. The hidden and output modules are at low competition and \ntherefore behave as a traditional feedforward network free to take on analog values. \nIn this analog mode, a real valued error can be defined for the hidden and output \nunits and standard learning algorithms like backpropagation can be used to train \nthe connections. \nThen the situation reverses. For a Reber grammar there are always two equally pos(cid:173)\nsible next symbols being activated in the output layer, and we let the crosstalk noise \n\n\fGrammatical Inference by Attentional Control of Synchronization \n\n73 \n\nbreak this symmetry so that the winner-take-all dynamics of the output module can \nchose one. High competition has now also \"quantized\" and clamped the activity in \nthe hidden layer to a fixed binary vector. Meanwhile, competition is lowered in the \ninput and context layers, freeing these modules from their attractors. An identity \nmapping from hidden to context loads the binarized activity of the hidden layer \ninto the context layer for the next cycle, and an additional identity mapping from \nthe output to input module places the chosen output symbol into the input layer \nto begin the next cycle. \n\n4 Attentional control of Synchrony \n\nWe introduce a model of attention as control of program flow by selective synchro(cid:173)\nnization. The attentional controler itself is modeled in this architecture as a special \nset of three hidden modules with ouputs that affect the resonant frequencies of the \nother corresponding three subsets of hidden modules. Varying levels of intramodule \ncompetition control the large scale direction of information flow between layers of the \narchitecture. To direct information flow on a finer scale, the attention mechanism \nselects a subset of modules within each layer whose output is effective in driving the \nstate transition behavior of the system. \nBy controling the patterns of synchronization within the network we are able to \ngenerate the grammar obtained from an automaton consisting of two subgraphs \nconnected by a single transition state (figure 1). During training we enforce a seg(cid:173)\nregation of the hidden layer code for the states of the separate subgraph branches of \nthe automaton so that different sets of synchronized modules learn to code for each \nsubgraph of the automaton. Then the entire automaton is hand constructed with \nan additional hidden module for the start state between the branches. Transitions \nin the system from states in one subgraph of the automaton to the other are made \nby \"attending\" to the corresponding set of nodes in the hidden and context layers. \nThis switching of the focus of attention is accomplished by changing the patterns \nof synchronization within the network which changes the flow of communication \nbetween modules. \nEach control module modulates the intrinsic frequency of the units coding for the \nstates a single su bgraph or the unit representing the start state. The control modules \nrespond to a particular input symbol and context to set the intrinsic frequency of \nthe proper subset of hidden units to be equal to the input layer frequency. As \ndescribed earlier, modules can easily be desynchronized by perturbing their resonant \nfrequencies. By perturbing the frequencies of the remaining modules away from the \ninput frequency, these modules are no longer communicating with the rest of the \nnetwork. Thus coherent information flows from input to output only through one \nof three channels. Viewing the automata as a behavioral program, the control \nof synchrony constitutes a control of the program flow into its subprograms (the \nsubgraphs of the automaton). \nWhen either exit state of a subgraph is reached, the \"B\" (begin) symbol is then \nemitted and fed back to the input where it is connected through the first to second \nlayer weight matrix to the attention control modules. It turns off the synchrony \nof the hidden states of the subgraph and allows entrainment of the start state to \nbegin a new string of symbols. This state in turn activates both a \"T\" and a \"P' in \nthe output module. The symbol selected by the crosstalk noise and fed back to the \ninput module is now connected to the control modules through the weight matrix. \nIt desynchronizes the start state module, synchronizes in the subset of hidden units \n\n\f74 \n\nBaird. Troyer. and Eeckman \n\ncoding for the states of the appropriate subgraph, and establishes there the start \nstate pattern for that subgraph. \nFuture work will investigate the possibilities for self-organization of the patterns of \nsynchrony and spatially segregated coding in the hidden layer during learning. The \nweights for entire automata, including the special attention control hidden units, \nshould be learned at once. \n\n4.1 Acknowledgments \n\nSupported by AFOSR-91-0325, and a grant from LLNL. It is a pleasure to acknowl(cid:173)\nedge the invaluable assistance of Morris Hirsch, and Walter Freeman. \n\nReferences \n\n[Baird, 1990] Baird, B. (1990). Bifurcation and learning in network models of oscil(cid:173)\n\nlating cortex. In Forest, S., editor, Emergent Computation, pages 365-384. North \nHolland. also in Physica D, 42. \n\n[Baird and Eeckman, 1993] Baird, B. and Eeckman, F. H. (1993). A normal form \n\nprojection algorithm for associative memory. In Hassoun, M. H., editor, Asso(cid:173)\nciative Neural Memories: Theory and Implementation, New York, NY. Oxford \nUniversity Press. \n\n[Baird et al., 1993] Baird, B., Troyer, T., and Eeckman, F. H. (1993). Synchro(cid:173)\n\nnization and gramatical inference in an oscillating elman network. In Hanson, \nS., Cowan, J., and Giles, C., editors, Advances in Neural Information Processing \nSystems S, pages 236-244. Morgan Kaufman. \n\n[Bressler and Nakamura, 1993] Bressler, S. and Nakamura. (1993). Interarea syn(cid:173)\nchronization in Macaque neocortex during a visual discrimination task. In Eeck(cid:173)\nman,F. H., and Bower, J., editors, Computation and Neural Systems, page 515. \nKluwer. \n\n[Cleeremans et al., 1989] Cleeremans, A., Servan-Schreiber, D., and McClelland, J. \n\n(1989). Finite state automata and simple recurrent networks. Neural Computa(cid:173)\ntion, 1(3):372-381. \n\n[Elman, 1991] Elman, J. (1991). Distributed representations, simple recurrent net(cid:173)\n\nworks and grammatical structure. Machine Learning, 7(2/3):91. \n\n[Freeman and Baird, 1987] Freeman, W. and Baird, B. (1987). Relation of olfactory \n\nEEG to behavior: Spatial analysis. Behavioral Neuroscience, 101:393-408. \n\n[Gray and Singer, 1987] Gray, C. M. and Singer, W. (1987). Stimulus dependent \nneuronal oscillations in the cat visual cortex area 17. Neuroscience [Supplj, \n22:1301P. \n\n[Guckenheimer and Holmes, 1983] Guckenheimer, J. and Holmes, D. (1983). Non(cid:173)\nlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. \nSpringer, New York. \n\n\fADAPTIVE KNOT PLACEMENT FOR \n\nNONPARAMETRIC REGRESSION \n\nHossein L. Najafi* \n\nVladinlil' Cherkas sky \n\nDepartment of Computer Science \n\nDepartment of Electrical Engineering \n\nUniversity of Wisconsin \nRiver Falls, WI 54022 \n\nUniversity of Minnesota \n\nMinneapolis, Minnesota 55455 \n\nAbstract \n\nPerformance of many nonparametric methods critically depends \non the strategy for positioning knots along the regression surface. \nConstrained Topological Mapping algorithm is a novel method that \nachieves adaptive knot placement by using a neural network based \non Kohonen's self-organizing maps. We present a modification to \nthe original algorithm that provides knot placement according to \nthe estimated second derivative of the regression surface. \n\n1 \n\nINTRODUCTION \n\nHere we consider regression problems. Using mathematical notation, we seek to find \na function f of N - 1 predictor variables (denoted by vector X) from a given set of \nn data points, or measurements, Zi = (Xi , Yi ) (i = 1, ... , n) in N-dimensional \nsample space: \n\nY = f(X) + error \n\n(l) \n\nwhere error is unknown (but zero mean) and its distribution may depend on X. The \ndistribution of points in the training set can be arbitrary, but uniform distribution \nin the domain of X is often used. \n\n\u2022 Responsible \n\nfor \n\ncorrespondence, \n\nTelephone \n\n(715) \n\n425-3769, \n\ne-mail \n\nhosseiu.najafi@uwrf.edu. \n\n247 \n\n\f248 \n\nNajafi and Cherkassky \n\nThe goal of this paper is to show how statistical considerations can be used to \nimprove the performance of a novel neural network algorithm for regression [eN91], \nin order to achieve adaptive positioning of knots along the regression surface. By \nestimating and employing the second derivative of the underlying function, the \nmodified algorithm is made more flexible around the regions with large second \nderivative. Through empirical investigation, we show that this modified algorithm \nallocates more units around the regions where the second derivative is large. This \nincrease in the local knot density introduces more flexibility into the model (around \nthe regions with large second derivative) and makes the model less biased around \nthese regions. However, no over-fitting is observed around these regions. \n\n2 THE PROBLEM OF KNOT LOCATION \n\nOne of the most challenging problems in practical implementations of adaptive \nmethods for regression is adaptive positioning of knots along the regression surface. \nTypical1y, knot positions in the domain of X are chosen as a subset of the training \ndata set, or knots are uniformly distributed in X. Once X-locations are fixed, \ncommonly used data-driven methods can be applied to determine the number of \nknots. However, de Boor [dB78] showed that a polynomial spline with unequally \nspaced knots can approximate an arbitrary function much better than a spline \nwith equally spaced knots. Unfortunately, the minimization problem involved in \ndetermination of the optimal placement of knots is highly nonlinear and the solution \nspace is not convex [FS89). Hence, t.he performance of many recent algorit.hms that \ninclude adaptive knot placement (e.g. MARS) is difficult to evaluate analytically. In \naddition, it is well-known that when data points are uniform, more knots should be \nlocated where the second derivative of the function is large. However, it is difficult to \nextend these results for non-uniform data in conjunction with data-dependent noise. \nAlso, estimating the second derivative of a true function is necessary for optimal \nknot placement. Yet, the function itself is unknown and its estimation depends on \nthe good placement of knots. This suggests the need for some iterative procedure \nthat alternates between function estimation(smoothing) and knot posit.ioning steps. \n\nMany ANN methods effectively try to solve the problem of adaptive knot loca(cid:173)\ntion using ad hoc strategies that are not statistically optimal. For example, local \nadaptive methods [Che92) are generalizat.ion of kernel smoothers where the ker(cid:173)\nnel functions and kernel centers are determined from the data by some adaptive \nalgorithm. Examples of local adaptive methods include several recently proposed \nANN models known as radial basis function (RBF) networks, regularization net(cid:173)\nworks, networks with locally tuned units etc [BL88, MD89, PG90). When applied \nto regression problems, all these methods seek to find regression estimate in the \n(most general) form 2::=1 biHi(X, Ci ) where X is the vector of predictor variable, \nCi is the coordinates of the i-th 'center' or 'bump', Hi is the response function of \nthe kernel type (the kernel width may be different for each center i), bi are linear \ncoefficients to be determined, and k is the total number of knots or 'centers'. \n\nWhereas the general formulat.ion above assumes global opt.imizat.ion of an error mea(cid:173)\nsure for the training set with respect. to all parameters, i.e. center locations, kernel \nwidth and linear coefficients, this is not practically feasible because the error surface \nis generally non-convex and may have local minima [PG90, MD89). Hence most \n\n\fAdaptive Knot Placement for Nonparametric Regression \n\n249 \n\npractical approaches first solve the problem of center(knot) location and assume \nidentical kernel functions. Then the remaining problem of finding linear coefficients \nbi is solved by using familiar methods of Linear Algebra [PG90] or gradient-descent \ntechniques [MD89]. It appears that the problem of center locations is the most \ncritical one for the local neural network techniques. Unfortunately, heuristics used \nfor center location are not based on any statistical considerations, and empirical \nresults are too sketchy [PG90, MD89]. In statistical methods knot locations are \ntypically viewed as free parameters of the model, and hence the number of knots \ndirectly controls the model complexity. Alternatively, one can impose local regu(cid:173)\nlarization constraints on adjacent knot locations, so that neighboring knots cannot \nmove independently. Such an approach is effectively implemented in the model of \nself-organization known as Kohonen's Self-Organizing Maps (SOM) [Koh84]. This \nmodel uses a set of units (\"knots\") with neighborhood relations between units de(cid:173)\nfined according to a fixed topological structure (typically 1 D or 2D grid). During \ntraining or self-organization, data points are presented to the map iteratively, one \nat a time, and the unit closest to the data moves towards it, also pulling along its \ntopological neighbors. \n\n3 MODIFIED CTM ALGORITHM FOR ADAPTIVE \n\nKNOT PLACEMENT \n\nThe SOM model has been applied to nonparametric regression by Cherkassky and \nNajafi [CN9I] in order to achieve adaptive positioning of knots along the regres(cid:173)\nsion surface. Their technique, called Constrained Topological Mapping (CTM), is a \nmodification of Kohonen's self-organization suitable for regression problems. CTM \ninterprets the units of the Kohonen map as movable knots of a regression surface. \nCorrespondingly, the problem of finding regression estimate can be stated as the \nproblem of forming an M - dimensional topological map using a set of samples \nfrom N-dimensional sample space (where AI ~ N - 1) . Unfortunately, straight(cid:173)\nforward application of the Kohonen Algorithm to regression problem does not work \nwell [CN9I]. Because, the presence of noise in the training data can fool the algo(cid:173)\nrithm to produce a map that is a multiple-valued function of independent variables \nin the regression problem (1). This problem is overcome in the CTM algorithm, \nwhere the nearest neighbor is found in the subspace of predictor variables, rather \nthan in the input(sample) space [CN9I]. \n\nWe present next a concise description of the CTM algorithm. Using standard for(cid:173)\nmulation (1) for regression, the training data are N-dimensional vectors Zi = (Xi \n, Yi), where Y i is a noisy observation of an unknown function of N - 1 predictor \nvariables given by vector Xi. The CTM algorithm constructs an M - dimensional \ntopological map in N-dimensional sample space (M ~ N - 1) as follows: \n\no. Initialize the M - dimensional t.opological map in N-dimensional sample \n\nspace. \n\n1. Given an input vector Z in N-dimensional sample space, find the closest \n\n(best matching) unit i in the subspace of independent val\u00b7iables: \n\nII Z*(k) - Wi II = Minj{IIZ* - W; II} \n\nVj E [I, ... ,L] \n\n\f250 \n\nNajafi and Cherkassky \n\nwhere Z\u00b7 is the projection of the input vector onto the subspace of inde(cid:173)\npendent variables, Wi is the projection of the weight vector of unit j, and \nk is the discrete time step. \n\n2. Adjust the units' weights according to the following and return to 1: \n\n(2) \nwhere /3( k) is the learning rate and Cj (k) is the neighborhood for unit i at \niteration k and are given by: \n\n'Vi \n\n(k: .. ) \n\n/3(k) = /30 x (~~) \n\n,Cj(k) = -----~~ \n\n(3) \n\n1 \n\no 5 ( \n\nIIi - ill ) \n/3(k) x So \n\nexp' \n\nwhere kmax is the final value of the time step (k max is equal to the product of \nthe training set size by the number of times it was recycled), /30 is the initial \nlearning rate, and /3/ is the final learning rate (/30 = 1.0 and /3/ = 0.05 were \nused in all of our experiments), Iii - ill is the topological distance between \nthe unit i and the best matched unit i and So is the initial size of the map \n(i.e., the number of units per dimension) . \n\nNote that CTM method achieves placement of units (knots) in X-space according \nto density of training data. This is due to the fact that X-coordinates of CTM units \nduring training follow the standard Kohonen self-organization algorithm [Koh84], \nwhich is known to achieve faithful approximation of an unknown distribution. How(cid:173)\never, existing CTM method does not place more knots where the underlying function \nchanges rapidly. The improved strategy for CTM knot placement in X-space takes \ninto account estimated second derivative of a function as is described next. \n\nThe problem with estimating second derivative is that the function itself is unknown. \nThis suggests using an iterative strategy for building a model, i.e., start with a crude \nmodel, estimate the second derivative based on this crude model, use the estimated \nsecond derivative to refine the model, etc. This strategy can be easily incorporated \ninto the CTM algorithm due to its iterative nature. Specifically, in CTM method \nthe map of knots(i.e., the model) becomes closer and closer to the final regression \nmodel as the training proceeds. Therefore, at each iteration, the modified algorithm \nestimates the second derivative at the best matching unit (closest to the presented \ndata point in X-space), and allows additional movement of knots proportional to \nthis estimate. Estimating the second derivative from the map (instead of using the \ntraining data) makes sense due to smoothing properties of CTM. \n\nThe modified CTM algorithm can be summarized as follows: \n\n1. Present training sample Zi = (Xi, Yi) to the map and find the closest (best \nmatching) unit i in the su bspace of independent variables to this data point. \n(same as in the original CTM) \n\n2. Move the the map (i.e., the best matching unit and all its neighbors) toward \n\nthe presented data point (same as in the original CTM) \n\n\fAdaptive Knot Placement for Nonparametric Regression \n\n251 \n\n3. Estimate average second derivative of the function at the best matching \n\nunit based on the current positions of the map units. \n\n4. Normalize this average second derivative to an interval of [0,1]. \n5. Move the map toward the presented data point at a rate proportional to \n\nthe estimated normalizes average second derivative and iterate. \n\nFor multivariate functions only gradients along directions given by the topological \nstructure ofthe map can be estimated in step 4. For example, given a 2-dimensional \nmesh that approximates function I(XI, X2), every unit of the map (except the border \nunits for which there will be only one neighbor) has two neighboring units along \neach topological dimension. These neighboring units can be used to approximate \nthe function's gradients along the corresponding topological dimension of the map. \nThese values along each dimension can then be averaged to provide a local gradient \nestimate at a given knot. \nIn step 5, estimated average second derivative I\" is normalized to [0,1] range using \n1/Ji = 1 - exp(lf\"ll tan(T)) This is done because the value of second derivative is used \nas the learning rate. \n\nIn step 6, the map is modified according to the following equation: \n\n'Vj \n\n(4) \n\nIt is this second movement of the map that allows for more flexibility around the \nregion of the map where the second derivative is large. The process described by \nequation (4) is equivalent to pulling all units towards the data, with the learning \nrate proportional to estimated second derivative at the best matched unit. Note \nthat the influence of the second derivative is gradually increased during the process \nof self-organization by the factor (1-,B( k)). This factor account for the fact that the \nmap becomes closer and closer to the underlying function during self-orga.nization; \nhence, providing a more reliable estimate of second deriva.tive. \n\n4 EMPIRICAL COMPARISON \n\nPerformance of the two algorithms (original and modified CTM) was compared for \nseveral low-dimensional problems. In all experiments the two algorithms used the \nsame training set of 100 data points for the univariate problems and 400 data points \nfor the 2-variable problems. \nThe training samples (Xi, Yi) were generated according to (1), with Xi randomly \ndrawn from a uniform distribution in the closed interval [-1,1]' and the error drawn \nfrom the normal distribution N(O, (0.1)2). Regression estimates produced by the \nself-organized maps were tested on a different set of n = 200 samples (test set) \ngenerated in the same manner as the training set. \n\nWe used the Average Residual, AR = j ~ L~=l [Yi - I(Xd]2, as the performance \nmeasure on the test set. Here, I(X) is the piecewise linear estimate of the function \nwith knot locations provided by coordinates of the units of trained CTM. The Aver-\n\n\f252 \n\nNajafi and Cherkassky \n\nage Residual gives an indication of standard deviation of the overall generalization \nerror. \n\n1.2 \n1 \n0.8 \n-. 0.6 \n>( \nt::;' 0.4 \n0.2 \n0 ~~~~~ ................ : ............... ~. ~~~~~ \n\nTrue function \nOriginal CTM ~-. \nModified CTM -+--\n\n-0.2 \n\n-0.8 \n\n-0.6 \n\n-0.4 \n\n-0.2 \n\no \n\nx \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1 \n\nFigure 1: A 50 unit map formed by the original and modified algorithm for the \nGaussian function. \n\nTrue function \nOriginal CTM ~- . \nModified CTM -+--\n\n1.2 \n1 \n0.8 \nQ 0.6 \nt::;' 0.4 \n0.2 \no ~~~_00!6I~~~;..J ............................................................ . \n-0.2 I....-_ ........ __ ........ __ --'-__ ......... __ - ' -__ ..L.-__ .L..-_........IL.-_---' \n1 \n\n0.7 \n\n0.2 \n\n0.3 \n\n0.8 \n\n0.9 \n\n0.1 \n\n0.4 \n\n0.5 \n\n0.6 \n\nx \n\nFigure 2: A 50 unit map formed by the original and modified algorithm for the step \nfunction. \nWe used a gaussian function (f(x) = exp-64X 2\n) and a step function for our first set \nof experiments. Figure 1 and 2 show the actual maps formed by the original and \nmodified algorithm for these functions. It is clear from these figures that the modi(cid:173)\nfied algorithm allocates more units around the regions where the second derivative \nis large. This increase in the local knot density has introduced more flexibility into \nthe model around the regions with large second derivatives. As a result of this the \n\n\fAdaptive Knot Placement for Nonparametric Regression \n\n253 \n\nmodel is less biased around these regions. However, there is no over-fitting in the \nregions where the second derivative is large. \n\n0.29 r-----:\"-___r---..,..---~---~---r__--___r--___, \n0.28 \n0.27 \n0.26 \n0.25 \nc.::: 0.24 \n~ 0.23 \n0.22 \n0.21 \n0.2 \n0.19 \n0.18 \"\"'-__ --L. ___ .......... __ ----L _ __ .......... _ _ _ ' - - -__ ......L.. __ - - - - ' \n70 \n\n--aoj- __ +-- ---r- ____ + __ + __ _ \n\nOriginal CTM ~ \nModified CTM +_. \n\n... +---+ \n\n',~-~ \n\n~---- '\"'+--__ \n\n-L ...... \n\n---y-\n\n60 \n\n50 \n\n..L \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\n# of units per dimension \n\nFigure 3: Average Residual error as a function of the size of the map for the 3-\ndimensional Step function \n\n0.55 \n0.5 \n0.45 \n0.4 \nc.::: \n~ 0.35 \n0.3 \n0.25 \n0.2 \n\nOriginal CTM ~ \nModified CTM +_. \n\n----~---+ +-\n\n---\n\n--+---\n\n+- +-\n\n--\n\n--+--+--+--+ \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n# of units per dimension \n\n50 \n\n60 \n\n70 \n\nFigure 4: Average Residual error as a function of the size of the map for the 3-\ndimensional Sine function \n\nTo compare the behavior of the two algorithms in their predictability of structureless \ndata, we trained them on a constant function I(x) = a with eTT01' = N(O, (0.1)2). \nThis problem is known as smoothing pure noise in regression analysis. It has been \nshown [CN9l] that the original algorithm handles this problem well and quality of \nCTM smoothing is independent of the number of units in the map. Our experiments \n\n\f254 \n\nNajafi and Cherkas sky \n\nshow that the modified algorithm performs as good as the original one in this \nrespect. \n\nFinally, we used the following two-variable functions (step, and sine) to see how \nwell the modified algorithm performs in higher dimensional settings. \nSte : f(x x) = {I for ((x~ < 0.5) 1\\ (X2 < 0.5)) V ((Xl ~ 0.5) 1\\ (X2 ~ 0.5)) \n\nP I , 2 \n\n0 otherwise \n\nSine: f(XI, X2) = sin (27rJ(xt)2 + (X2)2) \nThe results of these experiments are summarized in Figure 3 and 4. Again we see \nthat the modified algorithm outperforms the original algorithm. Note that the above \nexample of a two-variable step function can be easily handled by recursive partition(cid:173)\ning techniques such as CART [BFOS84]. However, recursive methods are sensitive to \ncoordinate rotation. On the other hand, CTM is a coordinate-independent method, \ni.e. its performance is independent of any affine transformation in X-space. \n\nReferences \n\n[BFOS84] 1. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification \n\n[BL88] \n\n[Che92] \n\n[CN91] \n\n[dB78] \n[FS89] \n\n[MD89] \n\n[PG90] \n\nand Regression Trees. Wadswordth, Belmont, CA, 1984. \nD.S. Broomhead and D. Lowe. Multivariable functional interpolation \nand adaptive networks. Complex Systems, 2:321-355, 1988. \nV. Cherkassky. Neural networks and nonparametric regression. In S.Y. \nKung, F. Fallside, J .Aa. Sorenson, and C.A. Kamm, editors, Neural Net-\nworks for Signal Processing, volume II. IEEEE, Piscataway, N J, 1992. \nV. Cherkassky and H.L. Najafi. Constrained topological mapping for \nnonparametric regression analysis. Neural Networks, 4:27-40, 1991. \nC. de Boor. A Practical Guide to Splines. Springer-Verlag, 1978. \nJ .H. Friedman and B.W. Silverman. Flexible parsimonious smoothing \nand additive modeling. Technometrics, 31(1):3-21, 1989. \n\n[Koh84] T. Kohonen. Self-Organization and Associative Memory. Springer(cid:173)\n\nVerlag, third edition, 1984. \nJ. Moody and C.J. Darken. Fast learning in networks of locally tuned \nprocessing units. Neural Computation, 1:281, 1989. \nT. Poggio and F. Girosi. Networks for approximation and learning. Pro(cid:173)\nceedings of the IEEE, 78(9):1481-1497, 1990. \n\n\f", "award": [], "sourceid": 818, "authors": [{"given_name": "Hossein", "family_name": "Najafi", "institution": null}, {"given_name": "Vladimir", "family_name": "Cherkassky", "institution": null}]}