{"title": "A Framework for the Cooperation of Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 781, "page_last": 788, "abstract": null, "full_text": "A Framework for  the  Cooperation \n\nof  Learning  Algorithms \n\nLeon  Bottou \n\nPatrick  Gallinari \n\nLaboratoire de Recherche en Informatique \n\nUniversite de Paris XI \n91405 Orsay Cedex \n\nFrance \n\nAbstract \n\nWe introduce a framework  for  training architectures composed of several \nmodules. This framework,  which  uses a statistical formulation  of learning \nsystems,  provides  a  unique  formalism  for  describing  many  classical \nconnectionist  algorithms  as  well  as  complex  systems  where  several \nalgorithms interact. It allows to design hybrid systems which combine the \nadvantages of connectionist algorithms as well as other learning algorithms. \n\n1  INTRODUCTION \n\nMany recent achievements in  the connectionist area have been carried out by designing \nsystems  where  different algorithms  interact.  For example (Bourlard & Morgan,  1991) \nhave mixed a  Multi-Layer Perceptron (MLP) with a Dynamic Programming algorithm. \nAnother impressive application (Le Cun, Boser & al., 1990) uses a very complex multi(cid:173)\nlayer architecture, followed by some statistical decision process. Also, in speech or image \nrecognition systems, input signals are sequentially processed through different modules. \nModular systems are the most promising way  to achieve such complex tasks.  They can \nbe built using simple components and therefore can be easily modified or extended, also \nthey allow to incorporate into their architecture some structural a priori knowledge about \nthe task  decomposition.  Of course,  this  is  also  true  for  connectionism,  and  important \n\n781 \n\n\f782 \n\nBottou and Gallinari \n\nprogresses  in  this  field  could  be  achieved  if  we  were  able  to  train  multi-modules \narchitectures. \n\nIn  this  paper,  we  introduce  a  formal  framework  for  designing  and  training  such \ncooperative  systems.  It provides a  unique  formalism  for  describing  both  the different \nmodules  and  the  global  system.  We  show  that  it  is  suitable  for  many  connectionist \nalgorithms,  which  allows  to  make them  cooperate in  an  optimal  way  according to  the \ngoal of learning. It also allows to train hybrid systems where connectionist and classical \nalgorithms interact. Our formulation  is based on a probabilistic approach to the problem \nof learning which is described in section 2. One of the advantages of this approach is to \nprovide a formal  definition of the goal of learning.  In  section 3, we introduce modular \narchitectures where each module can be described using  this framework, and we derive \nexplicit formulas  for  training  the  global  system  through  a  stochastic gradient descent \nalgorithm.  Section  4  is devoted  to  examples,  including  the case of hybrid  algorithms \ncombining MLP and Learning Vector Quantization (Bollivier, Gallinari & Thiria, 1990). \n\n2  LEARNING  SYSTEMS \n\nThe probabilistic formulation of the problem of learning has been extensively studied for \nthree decades  (Tsypkin  1971), and applied to control, pattern recognition  and  adaptive \nsignal processing. We recall here the main ideas and refer to (Tsypkin 1971) for a detailed \npresentation. \n\n2.1  EXPECTED  COST \n\nLet x be an instance of the concept to learn. In the case of a pattern recognition problem \nfor example, x would be a pair (pattern, class). The concept is mathematically defined by \nan  unknown probability density function p(x) which  measures the likelihood of instance \nx. \n\nWe shall use a  system  parameterized by w to  perform  some task that depends on p(x). \nGiven  an  example x, we can define a  local  cost, J(x,w), that measures how  well our \nsystem  behaves on that example. For instance, for classification J would be zero if the \nsystem puts a pattern in the correct class, or positive in case of misclassification. \n\nLearning consists in  finding a parameter w\u00b7 that optimises some functional of the model \nparameters. For instance, one would like to minimize the expected cost (1). \n\nC(w) = f J(x,w) p(x)dx \n\n(1) \n\nThe expected cost cannot be explicitely computed, because the density p(x) is unknown. \nOur only knowledge of the process comes from  a series of observations {X1  ... xn} drawn \nfrom  the  unknown  density  p(x). Therefore,  the  quality  of our  system  can  only  be \nmeasured  through  the  realisations  J(x,w) of the  local  cost function  for  the  different \nobservations. \n\n\fA Framework for the Cooperation of Learning Algorithms \n\n783 \n\n2.2  STOCHASTIC  GRADIENT  DESCENT \n\nGradient  descent  algorithms  are  the  simplest  minimization  algorithms.  We  cannot, \nhowever,  compute  the  gradient  of the  expected  cost  (1),  because  p(x)  is  unknown. \nEstimating these derivatives on a training set {X1 ... xn}, gives the gradient algorithm (2), \nwhere  VJ denotes  the  gradient of J(x,w) with  respect  to  w, and  Et,  a  small  positive \nconstant. the \"learning rate\". \n\nWt+ 1 = Wt - Et - 2. V J(Xj,wt} \n\n1  n \nn  .  1 \n\n1-\n\n(2) \n\nThe stochastic  gradient descent algorithm  (3) is an alternative to algorithm (2).  At each \niteration, an example Xt is drawn at random, and a new value of w is computed. \n\nAlgorithm  (3)  is  faster  and  more  reliable  than  (2),  it is  the  only  solution  for  training \nadaptive systems like Neural networks (NN). Such stochastic approximations have been \nextensively studied in adpative signal processing (Benveniste. Metiver & Priouret, 1987). \n(Ljung & Soderstrom,  1983). Under certain conditions, algorithm  (3) converges almost \nsurely (Bottou, 1991). (White, 1991) and allows to reach an optimal state of the system. \n\n(3) \n\n3  MODULAR  LEARNING  SYSTEMS \n\nMost  often,  when  the  goal  of learning  is  complex,  it can  be  achieved more easily  by \nusing a decomposition of the global task into several simpler subtasks which for instance \nreflect  some  a priori  knowledge about the  structure of the  problem.  One can  use  this \ndecomposition to build modular architectures where each module will correspond to one of \nthe subtasks. \n\nWithin  this  framework,  we will  use  the  expected risk  (1)  as  the goal of learning.  The \nproblem  now  is  to  change  the  analytical  formulation  of the  functional  (1)  so  as  to \nintroduce the modular decomposition of the global  task. In (1), the analytic expression of \nthe local cost J(x,w) has two meanings: it describes a parametric relationship between the \ninputs  and  the  outputs  of the  system,  and  measures  the  quality  of  the  system.  To \nintroduce the decomposition, one may write this local cost J(x,w) as the composition of \nseveral functions. One of them will take into account the local error and therefore measure \nthe  quality  of the  system;  the  others  will  correspond  to  the  decomposition  of the \nparametric relationship between the inputs and the outputs of the system (Figure 1). Each \nof the  modules  will  therefore  receive  some  inputs  from  other  modules or the external \nworld and produce some outputs which will be sent to other modules. \n\n\f784 \n\nB ottou and Gallinari \n\na I-y-p \n\nFigure 1:  A modular system \n\nIn classical systems. these modules correspond to well defmed processing stages like e.g. \nsignal processing. filtering. feature extraction. classification. They are trained sequentially \nand then  linked together to build a complete processing system which takes some inputs \n(e.g.  raw  signals)  and  produces  some  outputs  (e.g.  classes).  Neither  the  assumed \ndecomposition.  nor  the  behavior of the different  modules  is  guaranteed  to  optimally \ncontribute  to  the  global  goal  of learning.  We  will  show  in  the  following  that  it  is \npossible to optimally train  such systems. \n\n3.1  TRAINING  MODULAR  SYSTEMS \n\nEach function in the above composition defmes a local processing stage or module whose \noutputs are defined by a parametric function of its inputs (4). \n\nV' je y-1 (n),  Yj = fj( (Xk)  ke X-1 (n)  ,  (Wi)  ie W-1 (n)  ) \n\n(4) \n\ny-1 (n) ( resp.  X-1 (n). and W-1 (n) ) denotes the set of subscripts associated to the outputs \nY ( resp.  inputs x and parameters W ) of module n. Conversely. output Yj  ( resp. input xk \nand parameter Wi ) belongs to module Y(j)  ( resp. X(k) and W(i) ). \n\nModules  are  linked  so  as  to  build  a  feed-forward  topology  which  is  expressed  by  a \nfunction cj). \n\nV'k,  xk = Y~(k) \n\n(5) \n\nWe shall consider that the first module only feeds the system with examples and that the \nlast module only computes Ylast = J(x,w). \n\nFollowing  (Le  Cun.  1988).  we can  compute  the  derivatives  of J  with  a  Lagrangian \nmethod. Let a and ~ be the Lagrange coefficients for constraints (4) and (5). \n\nL = J -L ~k(Xk-Y~(k)) - L aj (Yr!j( (Xk)  ke X-1Y(j),  (Wi)  ie W-1Y(j)  ))  (6) \nBy equating the derivatives of L with respect to x and Y to zero. we get recursive formulas \nfor computing a and ~ in a single backward pass along the acyclic graph cj). \n\nk \n\nj \n\n\fA Framework for the Cooperation of Learning Algorithms \n\n785 \n\nalast =  1, \n\nThen, the derivatives of J with respect to the weights are: \n\ndJ \n-(w) =  -(aRw) = \ndwi \n\ndL \ndwi \n\n,..\" \n\nd I: \n~ a'  :LJ.. \nJ  :l. ... \n\u00a3.J \nje y-1W{i)  UYVI \n\n(7) \n\n(8) \n\nOnce  we  have  computed  the  derivatives  of the  local  cost J(x,w), we  can  apply  the \nstochastic gradient descent algorithm (3) for minimizing of the expected cost C(w). \n\nWe shall say that each module is defined by the equations in (7) and (8) that characterize \nits behavior. These equations are: \n\n\u2022  a forward equation (F) \n\n\u2022  a backward equation (B) \n\n\u2022  a gradient equation (G) \n\nYj = fj( (xl<)  keX-1(n)  ,(Wi) ieW1(n) ) \n. Ell \n~=  L  a  J dXk \n\njeY-'X(k) \ndJ \ndwl \n\na/: \n~i=-.=  Laj  ~ \nje Y-'W(i)  awl \n\nThe remaining equations do not depend on the nature of the modules. They describe how \nmodules  interact  during  training.  Like  back-propagation,  they  address  the  credit \nassignment problem  between  modules  by  globally  minimizing  a  single  cost function. \nTraining  such  a  complex  system  actually  consists  in  cooperatively  training  its \ncomponents. \n\n4  EXAMPLES \n\nMost  learning  algorithms,  as  well  as  new  algorithms  may  be  expressed  as  modular \nlearning systems. Here are some simple examples of modules and systems. \n\n4.1 \n\nLINEAR  AND  QUASI-LINEAR  SYSTEMS \n\nMODULE \n\nSYMBOL \n\nMatrix product \n\nWx \n\nMean square error \n\nMSE \n\nFORWARD \nYi-tWikXk \n\nJ. t{dk'Xk)2 \n\nBACKWARD \n\nGRADIENT \n\n~k'\"\"L<Xjwik \n\ni \n\n~k=-2 (dk-xk) \n\n~ik=aixk \n\nPerceptron error  Perceptron  J.-t(dk-1 9t+(Xk\u00bbXk  ~k-- (dk-19t+(Xk\u00bb \n\nSigmoid \n\nsigmo'id \n\nYk\u00b7f(Xk) \n\n~k,\"\"f'(Xk)ak \n\nA few basic modules are defined in the above table. Figure 2 gives examples of linear and \nquasi linear algorithms derived by combining these modules. \n\n\f786 \n\nBottou and Gallinari \n\n(  W x  r.r  MSE  r J \n(  Wx  H sigmo'(d H Wx  H sigmoId H MSE  r J \n\n~~ perceptroj'\"  J \n\nL  Examples  ---' \n\nL  Examples --1 \n\nL \n\n1 \nExalT1lles  _____________  ..1 \n\nFigure 2:  An Adaline, a Perceptron, and a 2-Layer Perceptron. \n\nSome MLP architectures, Time Delay Networks for instance, use local connections and \nshared weights. Such complex architectures may be constructed by defining either quasi(cid:173)\nlinear unit modules or complex matrix operations modules like convolutions. The latter \nsolution  leads  to  more  efficient  implementations.  Figure  3  gives  an  example  of \nconvolution module, composed of several matrix products modules. \n\nXk \n\nI \nI \n\nw \n\nI - &.. ( Convolve ) - .  Yk \nYk  -\nI \n\nFigure 3: A convolution module, composed of several matrix product modules. \n\n4 \u00b0 2  EUCLIDIAN  DISTANCE  BASED  ALGORITHMS \n\nA wide class of learning systems are based on the measure of euclidian distances. Again, \ndefining  an  euclidian  distance  module  and  some adequate  cost  functions  allows  for \nhandling most euclidian distance based algorithms. Here are some examples: \n\nMODULE \n\nEuclidian distance \n\nSYMBOL \n(x-w)2 \n\nFORWARD \n\nBACKWARD \n~k=-2tUj(Wjk-Xk) \n\nGRADIENT \nAjk=2Uj(Wjk-Xk) \n\nMinimum \n\nMin \n\n~ko-1, ~k,oko-O \n\nLVQ 1 error \n\nLVQ1 \n\nIf the nearest reference Xk\u00b7  is associated to  the correct class \n\nJ - Xko =Min{xiJ \n\n~ko .1, ~k,ok\u00b7-O \n\nelse \n\nJ - -Xko =-Min{xiJ \n\n~ko -1, ~k,ok\u00b0.O \n\nCombining  an  euclidian  distance module  with  a  \"minimum\" error module  gives  a  K(cid:173)\nmeans algorithm;  combining  it with  a LVQI  error module gives  the LVQI  algorithm \n(Figure 4). \n\n\fA Framework for the Cooperation of Learning Algorithms \n\n787 \n\nt~~J \n\n....  J \n\nExamples \n\nFigure 4: K-Means and Learning Vector Quantization. \n\n4.3  HYBRID  ALGORITHMS \n\nHybrid algorithms which may combine classical and connectionist learning algorithms are \neasily  defined  by  chaining  appropriate  modules.  Figure  5,  for  instance,  depicts  an \nalgorithm  combining a  MLP  layer and LVQ1.  This algorithm  has  been  described  and \nempirically compared  to other pattern recognition algorithms in  (Bollivier, Gallinari & \nThiria, 1990). \n\nH sigmO\u00b7id)-.{  (w -X)2]-1  LVa 1  ~ J \n\n(  Wx \n\n1 \n\nExamples \n\n1 \n\nFigure 5:  An hybrid algorithm combining a MLP and L VQ. \n\nCooperative  training  gives  a  framework  and  a  possible  implementation  for  such \nalgorithms.  Nevertheless,  there  are  still  specific  problems  (e.g.  convergence, \ninitialization)  which  require  a careful  study.  More complex  hybrid  systems, including \ncombinations of Markov Models and Time Delay Networks, have been described within \nthis framework in (Bottou,1991). \n\n5  CONCLUSION \n\nCooperative  training  of modular  systems  provides  a  unified  view  of many  learning \nalgorithms,  as  well  as  hybrid  systems  which  combine  classical  or  connectionist \nalgorithms. Our formalism  provides a  way  to  define specific  modules and  to  combine \nthem into a cooperative system. This allows  to design and implement complex learning \nsystems which eventually incorporate structural a priori knowledge about the task. \n\nAcknowledgements \n\nDuring this work, L.B. was supported by DRET grant nO  87/808/19. \n\nReferences \n\nBenveniste A., Metivier M., Priouret P.  (1987) Algorithmes adaptatifs et approximations \nstochastiques, Masson \n\nBollivier M.  de, Gallinari P.  & Thiria S.  (1990)  Cooperation of neural  nets  for robust \nclassification, Procedings of I1CNN 90, San Diego, voll, 113-120. \n\n\f788 \n\nBoltou and Gallinari \n\nBottou L. (1991) Une approche thiorique de l' apprentissage connexionniste; applications \na la reconnaissance de la parole. PhD Thesis. Universite de Paris XI \n\nBourlard H .\u2022  Morgan  N.  (1991)  A Continuous Speech Recognition  System  Embedding \nMLP into HMM - In Touretzlc:y D.S .\u2022 Lipmann R.  (eds.) Advances in Neural Information \nProcessing Systems 3  (this volume). Morgan-Kaufman \n\nLe  Cun  Y.:  A  theoretical  framework  for  back-propagation  (1988)  in  Touretzky  D .\u2022 \nHinton  G.  &  Sejnowsky  T.  (eds.)  Proceedings  of the  1988  Connectionist  Models \nSummer School. 21-28. Morgan Kaufmann (1988) \n\nLe  Cun  Y .\u2022  Boser  B .\u2022  &  al..  (1990):  Handwritten  Digit  Recognition  with  a  Back(cid:173)\nPropagation Network- in D.Touretzky (ed.) Advances in Neural Information Processing \nSystems 2.  396-404. Morgan Kaufmann \n\nLjung L.  &  SMerstriim T.  (1983) Theory and Practice of Recursive Identification. MIT \nPress \n\nTsypkin  Ya.  (1971)  Adaptation  and  Learning  in  Automatic  systems.  Mathematics  in \nscience and engineering. vol 73. Academic Press \n\nWhite H.  (1991) An Overview of Representation and Convergence results for Multilayer \nfeed-forward  Networks.  Touretzky  D.S .\u2022  Lipmann  R.  (eds.)  Advances  in  Neural \nInformation Processing Systems 3  (this volume). Morgan-Kaufman \n\n\f", "award": [], "sourceid": 308, "authors": [{"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Patrick", "family_name": "Gallinari", "institution": null}]}