{"title": "A competitive modular connectionist architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 767, "page_last": 773, "abstract": null, "full_text": "A competitive modular connectionist architecture \n\nRobert A. Jacobs and Michael I. Jordan \n\nDepartment of Brain & Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe describe a multi-network, or modular, connectionist architecture that \ncaptures that fact that many tasks have structure at a level of granularity \nintermediate to that assumed by local and global function approximation \nschemes. The main innovation of the architecture is that it combines \nassociative and competitive learning in order to learn task decompositions. \nA task decomposition is discovered by forcing the networks comprising the \narchitecture to compete to learn the training patterns. As a result of the \ncompetition, different networks learn different training patterns and, thus, \nlearn to partition the input space. The performance of the architecture on \na \"what\" and \"where\" vision task and on a multi-payload robotics task \nare presented. \n\n1 \n\nINTRODUCTION \n\nA dichotomy has arisen in recent years in the literature on nonlinear network learn(cid:173)\ning rules between local approximation of functions and global approximation of \nfunctions. Local approximation, as exemplified by lookup tables, nearest-neighbor \nalgorithms, and networks with units having local receptive fields, has the advantage \nof requiring relatively few learning trials and tends to yield interpretable repre(cid:173)\nsentations. Global approximation, as exemplified by polynomial regression and \nfully-connected networks with sigmoidal units, has the advantage of requiring less \nstorage capacity than local approximators and may yield superior generalization. \n\nIn this paper, we report a multi-network, or modular, connectionist architecture \nthat captures the fact that many tasks have structure at a level of granularity \nintermediate to that assumed by local and global approximation schemes. It does so \n\n767 \n\n\f768 \n\nJacobs and Jordan \n\nExpert \n\nNetwork 1 \n\nYl \n\nExpert \n\nNetwork 2 \n\n~Y2 \n\n~-~------------------:-:~I __ ~_e_:_:_:_:k __ ~ \no \u00b7 y=g,y,+g2Y2 \n\nFigure 1: A Modular Connectionist Architecture \n\nby combining the desirable features of the approaches embodied by these disparate \napproximation schemes. In particular, it uses different networks to learn training \npatterns from different regions of the input space. Each network can itself be a \nlocal or global approximator for a particular region of the space. \n\n2 A MODULAR CONNECTIONIST ARCHITECTURE \n\nThe technical issues addressed by the modular architecture are twofold: (a) de(cid:173)\ntecting that different training patterns belong to different tasks and (b) allocating \ndifferent networks to learn the different tasks. These issues are addressed in the \narchitecture by combining aspects of competitive learning and associative learning. \nSpecifically, task decompositions are encouraged by enforcing a competition among \nthe networks comprising the architecture. As a result of the competition, differ(cid:173)\nent networks learn different training patterns and, thus, learn to compute different \nfunctions. The architecture was first presented in Jacobs, Jordan, Nowlan, and Hin(cid:173)\nton (1991), and combines earlier work on learning task decompositions in a modular \narchitecture by Jacobs, Jordan, and Barto (1991) with the mixture models view of \ncompetitive learning advocated by Nowlan (1990) and Hinton and Nowlan (1990). \nThe architecture is also presented elsewhere in this volume by Nowlan and Hin(cid:173)\nton (1991). \n\nThe architecture, which is illustrated in Figure 1, consists of two types of networks: \nexpert networks and a gating network. The expert networks compete to learn the \ntraining patterns and the gating network mediates this competition. Whereas the \nexpert networks have an arbitrary connectivity, the gating network is restricted to \nhave as many output units as there are expert networks, and the activations of these \noutput units must be nonnegative and sum to one. To meet these constraints, we \nuse the \"softmax\" activation function (Bridle, 1989); specifically, the activation of \n\n\fA Competitive Modular Connectionist Architecture \n\n769 \n\nthe ith output unit of the gating network, denoted gi, is \n\ne S , \ngi = ~n~-\n2: es ; \n\nj=l \n\n(1) \n\nwhere Si denotes the weighted sum of unit i's inputs and n denotes the number of \nexpert networks. The output of the entire architecture, denoted y, is \n\nn \n\ny = L.:: giYi \n\ni=l \n\n(2) \n\nwhere Yi denotes the output of the ith expert network. During training, the \nweights of the expert and gating networks are adjusted simultaneously using the \nbackpropagation algorithm (Ie Cun, 1985; Parker, 1985; Rumelhart, Hinton, and \nWilliams, 1986; Werbos, 1974) so as to maximize the function \n\n1 L \nn = n L..J gi e \n\n1 ~ _~IIY\u00b7_YiIl2 \n\n, \n\n(3) \n\ni=l \n\nwhere y'\" denotes the target vector and (1[ denotes a scaling parameter associated \nwith the ith expert network. \n\nThis architecture is best understood if it is given a probabilistic interpretation as an \n\"associative gaussian mixture model\" (see Duda and Hart (1973) and McLachlan \nand Basford (1988) for a discussion of non-associative gaussian mixture models). \nUnder this interpretation, the training patterns are assumed to be generated by \na number of different probabilistic rules. At each time step, a rule is selected \nwith probability gi and a training pattern is generated by the rule. Each rule is \ncharacterized by a statistical model of the form y'\" = Ii (x) + fi, where Ii (x) is a fixed \nnonlinear function of the input vector, denoted x, and fi is a random variable. If it \nis assumed that fi is gaussian with covariance matrix (1; I, then the residual vector \ny'\" - Yi is also gaussian and the cost function in Equation 3 is the log likelihood of \ngenerating a particular target vector y'\" . \n\nThe goal of the architecture is to model the distribution of training patterns. This is \nachieved by gradient ascent in the log likelihood function. To compute the gradient \nconsider first the rartial derivative of the log likelihood with respect to the weighted \nsum Si at the it output unit of the gating network. Using the chain rule and \nEquation 1 we find that this derivative is given by: \n\n8 In L \n-8-- = 9 z x, Y \nSi \n\n(. I \n\n\"') \n\n- gi \n\n(4) \n\nwhere g( i I x, y\"') is the a posteriori probability that the ith expert network generates \nthe target vector: \n\n('1 \n9 z x, Y = -,....;;l----~--IIY-. _-y-j 1-12 . \n\n\"') \n\ngi e\n\n, \n\n-~IIY\u00b7-YdI2 \n\n~ 2.,. . \nL..J gje \n' \nj=l \n\n(5) \n\n\f770 \n\nJacobs and Jordan \n\nThus the weights of the gating network are adjusted so that the network's outputs(cid:173)\nthe a priori probabilities gi-move toward the a posteriori probabilities. \nConsider now the gradient of the log likelihood with respect to the output of the \nith expert network. Differentiation of In L with respect to Yi yields: \n\n8ln L \n- 8 =g z x,Y 2 \u00b7 \nYi \n\n... ) (Y'\" - Y i) \n\n(. I \n\n(Ti \n\n(6) \n\nThese derivatives involve the error term Y'\" - Yi weighted by the a posteriori prob(cid:173)\nability associated with the ith expert network. Thus the weights of the network \nare adjusted to correct the error between the output of the ith network and the \nglobal target vector, but only in proportion to the a posteriori probability. For each \ninput vector, typically only one expert network has a large a posteriori probability. \nConsequently, only one expert network tends to learn each training pattern. In \ngeneral, different expert networks learn different training patterns and, thus, learn \nto compute different functions. \n\n3 THE WHAT AND WHERE VISION TASKS \n\nWe applied the modular connectionist architecture to the object recognition task \n(\"what\" task) and spatial localization task ( \"where\" task) studied by Rueckl, Cave, \nand Kosslyn (1989).1 At each time step of the simulation, one of nine objects is \nplaced at one of nine locations on a simulated retina. The \"what\" task is to identify \nthe object; the \"where\" task is to identify its location. \nThe modular architecture is shown in Figure 2. It consists of three expert networks \nand a gating network. The expert networks receive the retinal image and a task \nspecifier indicating whether the architecture should perform the \"what\" task or \nthe \"where\" task at the current time step. The gating network receives the task \nspecifier. The first expert network contains 36 hidden units, the second expert \nnetwork contains 18 hidden units, and the third expert network doesn't contain any \nhidden units (i.e., it is a single-layer network). \n\nThere are at least three ways that this modular architecture might successfully learn \nthe \"what\" and \"where\" tasks. One of the multi-layer expert networks could learn \nto perform both tasks. Because this solution doesn't show any task decomposition, \nwe consider it to be unsatisfactory. A second possibility is that one of the multi(cid:173)\nlayer expert networks could learn the \"what\" task, and the other multi-layer expert \nnetwork could learn the \"where\" task. Although this solution exhibits task decom(cid:173)\nposition, a shortcoming of this solution is apparent when it is noted that, using the \nretinal images designed by Rueckl et al. (1989), the \"where\" task is linearly separa(cid:173)\nble. This means that the structure of the single-layer expert network most closely \nmatches the \"where\" task. Consequently, a third and possibly best solution would \nbe one in which one of the multi-layer expert networks learns the \"what\" task and \nthe single-layer expert network learns the \"where\" task. This solution would not \nonly show tagk decomposition but also the appropriate allocation of tasks to expert \nnetworks. Simulation results show that the third possible solution is the one that \n\n1 For a detailed presentation of the application of an earlier mod ular architecture to the \n\n\"what\" and \"where\" tasks see Jacobs, Jordan, and Barto (1991). \n\n\fA Competitive Modular Connectionist Architecture \n\n771 \n\nTesk spaclflar \n\nt2ZI \n\nwhet where \n\n\u2022 of hidden 3eO \n\nunite \n\n\u2022 of output 90 \n\nunits \n\n\u2022 \u2022 \n\n\u2022 0 \n\nl \n\n\u2022 \u2022 \u2022 \n\n0 \n\n90 \n\n\u2022 \u2022 \u2022 \n\nleO \nI \n\n\u2022 \u2022 \u2022 \n\nl \n\n0 \n\n0 \n\n90 \n\n\u2022 \u2022 \u2022 \n\n0 \n\nFigure 2: The Modular Architecture Applied to the What and Where Tasks \n\nis always achieved. These results provide evidence that the modular architecture \nis capable of allocating a different network to different tasks and of allocating a \nnetwork with an appropriate structure to each task. \n\n4 THE MULTI-PAYLOAD ROBOTICS TASK \n\nWhen designing a compensator for a nonlinear plant, control engineers frequently \nfind it impossible or impractical to design a continuous control law that is useful \nin all the relevant regions of a plant's parameter space. Typically, the solution to \nthis problem is to use gain scheduling; if it is known how the dynamics of a plant \nchange with its operating conditions, then it may be possible to design a piecewise \ncontroller that employs different control laws when the plant is operating under \ndifferent conditions. From our viewpoint, gain scheduling is an attractive solution \nbecause it involves task decomposition. It circumvents the problem of determining \na fixed global model ofthe plant dynamics. Instead, the dynamics are approximated \nusing local models that vary with the plant's operating conditions. \n\nTask decomposition is a useful strategy not only when the control law is designed, \nbut also when it is learned. We suggest that an ideal controller is one that, like gain \nscheduled controllers, uses local models of the plant dynamics, and like learning \ncontrollers, learns useful control laws despite uncertainties about the plant or en(cid:173)\nvironment. Because the modular connectionist architecture is capable of both task \ndecomposition and learning, it may be useful in achieving both of these desiderata. \n\nWe applied the modular architecture to the problem of learning a feedforward con-\n\n\f772 \n\nJacobs and Jordan \n\n0.25 \n\n0.20 \n\n0.15 \n\nJoint \nRMSE \n\n(radians) \n\n0.10 \n\n0.05 \n\nSlI.i1e.m \nSN -- Single network \nMA -- Modular architecture \n\nSN \n\nMA \n\n0.00 +------r---.----r-~-___.--.---y____-..,.......___. \n10 \n\n4 \n\n2 \n\no \n\n6 \n\n8 \n\nFigure 3: Learning Curves for the Multi-Payload Robotics Task \n\nEpochs \n\ntroller for a robotic arm in a multiple payload task. 2 The task is to drive a simulated \ntwo-joint robot arm with a variety of payloads, each of a different mass, along a \ndesired trajectory. The architecture is given the payload's identity (e.g., payload A \nor payload B) but not its mass. \nThe modular architecture consisted of six expert networks and a gating network. \nThe expert networks received as input the state of the robot arm and the desired \nacceleration. The gating network received the payload identity. We also trained a \nsingle multi-layer network to perform this task. The learning curves for the two sys(cid:173)\ntems are shown in Figure 3. The horizontal axis gives the training time in epochs. \nThe vertical axis gives the joint root mean square error in radians. Clearly, the \nmodular architecture learned significantly faster than the single network. Further(cid:173)\nmore, the modular architecture learned to perform the task by allocating different \nexpert networks to control the arm with payloads from different mass categories \n(e.g., light, medium, or heavy payloads). \n\nAcknowledgements \n\nThis research was supported by a postdoctoral fellowship provided to the first author \nfrom the McDonnell-Pew Program in Cognitive Neuroscience, by funding provided \nto the second author from the Siemens Corporation, and by NSF grant IRI-9013991 \nawarded to both authors. \n\n2For a detailed presentation of the application of the modular architecture to the mul(cid:173)\n\ntiple payload robotics task see Jacobs and Jordan (1991). \n\n\fA Competitive Modular Connectionist Architecture \n\n773 \n\nReferences \n\nBridle, J. (1989) Probabilistic interpretation of feedforward classification network \n\noutputs, with relationships to statistical pattern recognition. In F. Fogelman(cid:173)\nSoulie & J. Herault (Eds.), Neuro-computing: Algorithms, Architectures, and \nApplications. New York: Springer-Verlag. \n\nDuda, R.O. & Hart, P.E. (1973) Pattern Classification and Scene Analysis. New \n\nYork: John Wiley & Sons. \n\nHinton, G.E. & Nowlan, S.J. (1990) The bootstrap Widrow-Hoff rule as a cluster(cid:173)\n\nformation algorithm. Neural Computation, 2, 355-362. \n\nJacobs, R.A. & Jordan, M.I. (1991) Learning piecewise control strategies in a mod(cid:173)\n\nular connectionist architecture. Submitted to IEEE Transactions on Neural \nNetworks. \n\nJacobs, R.A., Jordan, M.I., & Barto, A.G. (1991) Task decomposition through \ncompetition in a modular connectionist architecture: The what and where \nvision tasks. Cognitive Science, in press. \n\nJacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991) Adaptive mixtures \n\nof local experts. Neural Computation, in press. \n\nIe Cun, Y. (1985) Une procedure d'apprentissage pour reseau a seuil asymetrique \n\n[A learning procedure for asymmetric threshold network]. Proceedings of Cog(cid:173)\nnitiva, 85, 599-604. \n\nMcLachlan, G.J. & Basford, K.E. (1988) Mixture Models: Inference and Applica(cid:173)\n\ntions to Clustering. New York: Marcel Dekker. \n\nNowlan, S.J. (1990) Maximum likelihood competitive learning. In D.S. Touretzky \n(Ed.), Advances in Neural Information Processing Systems 2. San Mateo, CA: \nMorgan Kaufmann Publishers. \n\nNowlan, S.J. & Hinton, G.E. (1991) Evaluation of an associative mixture archi(cid:173)\nIn R.P. Lippmann, J. Moody, & D.S. \ntecture on a vowel recognition task. \nTouretzky (Eds.), Advances in Neural Information Processing Systems 3. San \nMateo, CA: Morgan Kaufmann Publishers. \n\nParker, D.B. (1985) Learning logic. Technical Report TR-47, Massachusetts Insti(cid:173)\n\ntute of Technology, Cambridge, MA. \n\nRueckl, J .G., Cave, K.R., & Kosslyn, S.M. (1989) Why are \"what\" and \"where\" \nprocessed by separate cortical visual systems? A computational investigation. \nJournal of Cognitive Neuroscience, 1, 171-186. \n\nRumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986) Learning internal repre(cid:173)\n\nsentations by error propagation. In D.E. Rumelhart, J.1. McClelland, & the \nPDP Research Group, Parallel Distributed Processing: Explorations in the Mi(cid:173)\ncrostructure of Cognition. Volume 1: Foundations. Cambridge, MA: The MIT \nPress. \n\nWerbos, P.J. (1974) Beyond Regression: New Tools for Prediction and Analysis in \n\nthe Behavioral Sciences. Ph.D. thesis, Harvard University, Cambridge, MA. \n\n\f", "award": [], "sourceid": 430, "authors": [{"given_name": "Robert", "family_name": "Jacobs", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}