{"title": "The Efficient Learning of Multiple Task Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 258, "abstract": null, "full_text": "The  Efficient  Learning of Multiple Task \n\nSequences \n\nSatinder P.  Singh \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst,  MA 01003 \n\nAbstract \n\nI  present a  modular  network architecture and  a  learning algorithm  based \non incremental dynamic  programming that allows  a  single  learning agent \nto  learn  to  solve  multiple  Markovian  decision  tasks  (MDTs)  with  signif(cid:173)\nicant  transfer  of learning  across  the  tasks.  I  consider  a  class  of  MDTs, \ncalled  composite  tasks,  formed  by temporally  concatenating a  number  of \nsimpler, elemental MDTs.  The architecture is  trained on a  set of compos(cid:173)\nite and  elemental MDTs.  The  temporal  structure  of a  composite  task  is \nassumed  to be  unknown  and the architecture learns  to produce  a  tempo(cid:173)\nral  decomposition.  It is  shown  that under certain  conditions  the  solution \nof a  composite  MDT can  be  constructed  by computationally inexpensive \nmodifications of the solutions  of its constituent elemental MDTs. \n\n1 \n\nINTRODUCTION \n\nMost  applications  of  domain  independent  learning  algorithms  have  focussed  on \nlearning  single  tasks.  Building more sophisticated  learning agents  that  operate  in \ncomplex environments will require handling multiple tasks/goals (Singh,  1992). Re(cid:173)\nsearch effort on the scaling problem has concentrated on discovering faster learning \nalgorithms,  and  while  that  will  certainly  help,  techniques  that  allow  transfer  of \nlearning across tasks will be indispensable for  building autonomous learning agents \nthat have to learn  to solve multiple tasks .  In this paper I consider  a  learning agent \nthat interacts with an external, finite-state,  discrete-time, stochastic dynamical en(cid:173)\nvironment and faces  multiple sequences  of Markovian decision  tasks  (MDTs). \n\n251 \n\n\f252 \n\nSingh \n\nEach  MDT requires  the agent to execute a  sequence of actions  to control the envi(cid:173)\nronment, either to bring it to a desired state or  to traverse a desired state trajectory \nover time.  Let S  be the finite set of states and A  be the finite set of actions available \nto  the  agent.l  At  each  time  step t,  the  agent  observes  the  system's current  state \nZt  E  S  and  executes  action  at  E  A.  As  a  result,  the  agent  receives  a  payoff with \nexpected  value  R(zt, at)  E R  and  the  system makes  a  transition to state  Zt+l  E  S \nwith  probability  P:r:t:r:t+l (at).  The  agent's  goal  is  to  learn  an  optimal  closed  loop \ncontrol policy, i.e., a function assigning actions to states, that maximizes the agent's \nobjective.  The objective  used  in  this  paper is  J  =  E~o -yt R(zt, at),  i.e.,  the  sum \nof  the  payoffs  over  an  infinite  horizon.  The  discount  factor,  0  ~ \"Y  ~ I,  allows \nfuture  payoff to  be  weighted  less  than  more  immediate  payoff.  Throughout  this \npaper, I  will  assume that the learning agent does not have access  to a  model of the \nenvironment.  Reinforcement  learning algorithms such as  Sutton's  (1988)  temporal \ndifference algorithm and Watkins's (1989) Q-Iearning algorithm can be used to learn \nto solve  single  MDTs  (also see  Barto  et al.,  1991). \nI  consider  compositionally-structured  MDTs  because  they  allow  the possibility of \nsharing  knowledge across  the many tasks  that have  common subtasks.  In  general, \nthere may be n elemental MDTs labeled TI , T2 ,  \u2022\u2022\u2022 , Tn.  Elemental MDTs cannot be \ndecomposed  into simpler subtasks.  Compo8ite  MDTs,  labeled  GI , G2 ,  \u2022\u2022\u2022 , Gm ,  are \nproduced by temporally concatenating a  number of elemental MDTs.  For example, \nG;  =  [T(j, I)T(j, 2) ... T(j, k)] is composite task j  made up of k elemental tasks that \nhave to be performed in the order listed.  For 1 $  i  $  k, T(j, i)  E {TI' T2 ,  \u2022\u2022\u2022 , Tn} is \nthe  itk  elemental task  in  the list for  task G;.  The sequence  of elemental tasks in a \ncomposite  task  will  be  referred  to as  the  decompo8ition  of the composite  task;  the \ndecomposition  is  assumed  to be unknown to the  learning agent. \n\nCompo8itional  learning  involves  solving  a  composite  task  by  learning  to  compose \nthe solutions of the elemental tasks in its decomposition.  It is  to be emphasized that \ngiven the  short-term,  evaluative  nature  of the  payoff from  the  environment  (often \nthe  agent  gets  informative  payoff only  at  the  completion  of the  composite  task), \nthe task of discovering the decomposition of a composite  task is formidable.  In this \npaper  I  propose  a  compositional learning  scheme in  which separate modules  learn \nto  solve  the  elemental  tasks,  and  a  task-sensitive  gating  module  solves  composite \ntasks  by learning to compose the  appropriate elemental modules  over  time. \n\n2  ELEMENTAL AND  COMPOSITE TASKS \n\nAll elemental tasks are MDTs that share the the same state set  S,  action set A, and \nhave the same environment dynamics.  The payoff function for  each elemental task \n11,  1  ~ i  ~ n,  is  ~(z, a)  =  EYES P:r:y(a)ri(Y)  - c(z, a),  where  ri(Y)  is  a  positive \nreward associated  with the  state Y resulting from  executing  action a  in state  Z  for \ntask 11,  and  c(z, a)  is  the  positive cost  of executing  action a  in state z.  I  assume \nthat  ri(z)  =  0  if Z  is  not  the  desired  final  state  for  11.  Thus,  the  elemental tasks \nshare the same  cost function  but have  their own reward functions. \n\nA  composite  task  is  not  itself  an  MDT  because  the  payoff is  a  function  of both \n\nlThe extension to the case where different sets of actions are available in different states \n\nis  straightforward. \n\n\fThe Efficient Learning of Multiple Task Sequences \n\n253 \n\nthe  state  and  the current elemental task,  instead  of the  state alone.  Formally,  the \nnew  state set2  for  a  composite  task,  S',  is  formed  by  augmenting  the  elements of \nset  S  by n  bits,  one for  each elemental  task.  For  each z,  E  S',  the  projected  3tate \nz  E  S  is  defined  as  the  state  obtained  by  removing  the  augmenting bits  from  z'. \nThe environment dynamics and cost function,  c,  for  a  composite task is  defined  by \nassigning  to each z,  E  S'  and a  E  A  the  transition probabilities  and  cost  assigned \nto  the  projected  state  z  E  S  and  a  E  A.  The  reward function  for  composite  task \nCj ,  rj, is  defined as follows.  rj( z') ;:::  0 if the following are all true:  i)  the projected \nstate  z  is  the final  state  for  some  elemental task in  the  decomposition  of Cj,  say \ntask  Ii,  ii)  the  augmenting  bits  of z'  corresponding  to  elemental  tasks  appearing \nbefore and including sub task Ti  in the decomposition of Cj  are one, and iii) the rest \nof the augmenting bits are  zero;  rj(z') = 0 everywhere else. \n\n3  COMPOSITIONAL  Q-LEARNING \n\nFollowing Watkins (1989), I define the Q-value, Q(z,a), for  z  E S  and a E A, as the \nexpected  return on taking action a in state  z  under  the condition  that an optimal \npolicy is followed  thereafter.  Given the Q-values, a  greedy policy  that in  each state \nselects  an action  with the  highest associated  Q-value, is  optimal.  Q-Iearning works \nas follows.  On executing action a in state z  at time t,  the resulting payoff and next \nstate  are used  to update  the estimate of the  Q-value at  time t,  Qt(z, a): \n\n(1.0 - Qt)Qt(z, a) + ae[R(z, a) + l' max Qt(Y, a')], \n\na'EA \n\n(1) \n\nwhere  Y is  the  state  at  time  t + 1,  and  at  is  the  value  of a  positive learning  rate \nparameter at time t.  Watkins and Dayan (1992) prove that under certain conditions \non  the  sequence  {at},  if every  state-action  pair  is  updated  infinitely  often  using \nEquation  1,  Qt  converges to the true  Q-values asymptotically. \nCompositional Q-Iearning (CQ-Iearning) is a  method for  constructing the Q-values \nof a  composite task from  the  Q-values of the elemental tasks in its  decomposition. \nLet  QT.(z,a)  be  the  Q-value  of (z,a),  z  E  S  and  a  E  A,  for  elemental  task  Ii, \nand  let  Q~:(z',a) be  the  Q-value  of  (z', a),  for  z'  E  S'  and  a  E  A,  for  task  Ii \nwhen  performed  as  part  of the  composite  task  Cj  = [T(j, 1) ... T(j, k)].  Assume \nIi  =  T(j, I) .  Note  that  the  superscript  on  Q refers  to  the  task and  the  subscript \nrefers to the elemental task currently being performed.  The absence of a superscript \nimplies that the  task is  elemental. \nConsider  a  set  of undiscounted  (1'  = 1)  MDTs  that  have  compositional  structure \nand satisfy  the following  conditions: \n(AI)  Each elemental task has a  single desired final state. \n(A2)  For  all  elemental  and  composite  tasks,  the  expected  value  of undiscounted \nreturn for  an optimal policy is  bounded both from  above and below for  all states. \n(A3)  The  cost  associated  with  each  state-action  pair  is  independent  of  the  task \nbeing accomplished. \n\n2The theory  developed in  this  paper does  not  depend on  the  particular extension  of S \nchosen, as long as  the appropriate connection  between the new states and the elements of \nS  can be made. \n\n\f254 \n\nSingh \n\n(A4)  For each elemental task 71,  the reward function  ri is  zero for  all states except \nthe desired final state for that task.  For each composite task Cj ,  the reward function \nrj is  zero  for  all states except  pouibly  the final states of the  elemental tasks in its \ndecomposition  (Section  2). \nThen, for  any elemental task Ii and for  all composite tasks C j  containing elemental \ntask 71,  the following holds: \n\nQ~:(z',a) \n\nQT.(Z, a) + K(Cj,T(j, I\u00bb, \n\n(2) \n\nfor  all  z' E  S' and a  E A, where z  E  S is the projected state, and K (Cj, T(j, I\u00bb \nis a \nfunction of the composite task Cj and subtask T(j, I),  where Ti  = T(j, I).  Note that \nK( Cj , T(j, I\u00bb is  independent of the  state and  the action.  Thus,  given solutions  of \nthe elemental tasks, learning the solution of a composite task with n elemental tasks \nrequires  learning only  the  values  of the  function  K  for  the  n  different subtasks.  A \nproof of Equation  2 is  given in Singh  (1992). \n\na \n\nWIll. \nNoIN \nN(O.G) \n\nQ \n\nNetwortc \n\n1 \n\nQ \n\nQ \n\n\u2022 \u2022\u2022  Networtt \n\nn \n\nFigure 1:  The CQ-Learning Architecture (CQ-L). This figure is adapted from Jacobs \net al.  (1991).  See  text for  details. \n\nEquation  2  is  based  on  the  assumption  that  the  decomposition  of  the  composite \ntasks  is  known.  In the next Section,  I  present a  modular  architecture and learning \nalgorithm that simultaneously discovers  the decomposition of a  composite task and \nimplements Equation 2. \n\n4  CQ-L:  CQ-LEARNING ARCHITECTURE \n\nJacobs  (1991)  developed  a  modular  connectionist  architecture  that  performs  task \ndecomposition.  Jacobs's gating architecture consists of several expert networks and \na  gating  network  that  has  an  output  for  each  expert  network.  The  architecture \nhas been used to learn multiple non-sequential tasks  within the supervised learning \n\n\fThe Efficient Learning of Multiple Task Sequences \n\n255 \n\nTable  1:  Tasks.  Tasks  Tl,  T2,  and  T3  are  elemental tasks;  tasks  Gl ,  G2 ,  and  G3 \nare composite  tasks.  The last  column describes the compositional structure  of the \ntasks. \n\nLabel  Command  De.eription \n'11 \nT2 \nT3 \n0 1 \nC2 \nC3 \n\nVlS1t  A \nVlS1t  B \nV1S1t  C \nVlSlt  A  and  then  C \nVlS1t  B  and  then  C \nV1S1t  A,  then  B  and  then  C \n\n000001 \n000010 \n000100 \n001000 \n010000 \n100000 \n\nDeeompo.ition \nTl \nT2 \nT3 \n1113 \nT2 T 3 \nT1 T2T3 \n\nparadigm.  I  extend  the modular  network architecture  to  a  CQ-Learning  architec(cid:173)\nture  (Figure  I),  called  CQ-L,  that  can  learn  multiple  compositionally-structured \nsequential tasks  even when  training information required for  supervised learning is \nnot  available.  CQ-L combines  CQ-learning  and  the gating  architecture to  achieve \ntransfer  of learning  by  \"sharing\"  the  solutions  of elemental  tasks  across  multiple \ncomposite  tasks.  Only  a  very  brief description  of the  CQ-L  is  provided  in  this \npaper;  details are given in Singh  (1992)  . \nIn  CQ-L  the  expert  networks  are  Q-learning  networks  that  learn  to  approximate \nthe  Q-values  for  the  elemental  tasks.  The  Q-networks  receive  as  input  both  the \ncurrent  state  and  the  current  action.  The  gating  and  bias  networks  (Figure  1) \nreceive  as  input  the  augmenting  bits  and  the  task  command  used  to  encode  the \ncurrent task being performed by the architecture.  The stochastic switch in Figure 1 \nselects  one  Q-network  at  each  time  step.  CQ-L's  output,  Q,  is  the  output  of the \nselected  Q-network added  to the output of the bias network. \n\nThe  learning  rules  used  to  train  the  network  perform  gradient  descent  in  the  log \nlikelihood,  L(t), of generating the estimate of the desired Q-value at time t,  denoted \nD(t),  and are  given below: \n\n8 log L(t) \nqj(t) + oQ  8qj(t) \n, \n8 log L(t) \nSi(t) + Og  8Si(t) \n,and \nb(t) + ob(D(t) - Q(t)), \n\nwhere  qj  is  the  output  of the  jt\"  Q-network,  Si  is  the  it\"  output  of  the  gating \nnetwork,  b is  the  output of the  bias  network,  and  0Q,  Ob  and  Og  are learning rate \nparameters.  The  backpropagation  algorithm  (  e.g.,  Rumelhart  et  al.,  1986)  was \nused  to update the weights in the  networks.  See  Singh  (1992)  for  details. \n\n5  NAVIGATION TASK \n\nTo illustrate the utility of CQ-L, I use a navigational test bed similar to the one used \nby Bachrach (1991)  that simulates a  planar robot that can translate simultaneously \n\n\f256 \n\nSingh \n\nc \n\nG \n\nFigure  2:  Navigation Testbed.  See  text for  details. \n\nand  independently  in  both  ~ and  y  directions.  It  can  move  one  radius  in  any \ndirection  on  each  time  step.  The  robot  has  8  distance  sensors  and  8  gray-scale \nsensors  evenly  placed  around  its  perimeter.  These  16  values  constitute  the  state \nvector.  Figure 2 shows a  display created  by the navigation simulator.  The bottom \nportion of the figure  shows  the robot's environment as  seen from above.  The upper \npanel  shows  the robot's state vector.  Three  different goal  locations,  A,  B, and  C, \nare marked on the test bed.  The set of tasks on which the robot is trained are shown \nin Table  1.  The elemental tasks  require  the robot  to  go  to the given goal  location \nfrom a random starting location in minimum time.  The composite tasks require the \nrobot  to go  to a  goal location via a  designated sequence of subgoallocations. \n\nTask commands were represented by standard unit basis vectors (Table 1), and thus \nthe architecture could  not  \"parse\"  the task command  to determine the decomposi(cid:173)\ntion of a  composite  task.  Each Q-network was  a feedforward connectionist network \nwith  a  single  hidden  layer  containing  128  radial  basis  units.  The  bias  and  gating \nnetworks  were  also feedforward  nets  with  a  single  hidden layer containing sigmoid \nunits.  For  all  ~ E  S U Sf  and  a  E  A,  c(~, a)  = -0.05.  ri(~) = 1.0  only if ~ is  the \ndesired  final  state of elemental  task  Ii,  or  if ~ E  Sf  is  the  final  state of composite \ntask Cii  ri(~) = 0.0 in  all  other states.  Thus, for  composite tasks  no intermediate \npayoff for  successful completion of subtasks  was provided. \n\n6  SIMULATION RESULTS \n\nIn  the  simulation  described  below,  the  performance  of  CQ-L  is  compared  to  the \nperformance  of a  \"one-for-one\"  architecture that implements  the  \"learn-each-task(cid:173)\nseparately\"  strategy.  The one-for-one  architecture has a  pre-assigned distinct  net-\n\n\fThe Efficient Learning of Multiple Task Sequences \n\n257 \n\nwork for  each  task,  which  prevents  transfer  of learning.  Each  network  of the  one(cid:173)\nfor-one  architecture was  provided  with  the augmented state. \n\n,oo \n\nI  .. \n\n\u2022 \n1 \n.-\nt \n.. \n1 \n\n'I \n\n0 \n\n-\n--- ON.\u00b7FOA-ONE \n\nCOA. \n\n... \n\n' ... \n\nTrW NIJ1rioer (for T .. k A) \n\n, ... \n\n-\n, . \n.. \nt\u00b7\u00b7 \n'I \n\u2022 \n1 \n\n8. \nI \n\noo \n\nCOA. \n\n-\n-- - ONE-FOA.oNE \n, \n\n\" \n' \n\n0 \n0 \n\n, \no', \nf  .... , \n\n'  ' \n~  ,~ I', \n'; V \\ \n.'t  . \n, \n,  ,1,1 \nI,' \n.. \nI \n\n... \n... \n1-\nt-\n'I \n\u2022 \n1-\n.. \n\nI \n\n,-\n\n, \n\nTrial Nurrber (for T .. k [AB)) \n\n-\n\nC<>L \n\n------\n-\n\n, \nTil .. Number (fer TMk [ABC)) \n\nFigure  3:  Learning Curves for  Multiple  tasks. \n\nBoth  CQ-L  and  the  one-for-one  architecture  were  separately  trained  on  the  six \ntasks  T 1 ,  T2,  T3 ,  C lI  C2 ,  and  C3  until they  could  perform  the  six tasks  optimally. \nCQ-L  contained  three  Q-networks,  and  the  one-for-one  architecture  contained  six \nQ-networks.  For  each  trial,  the  starting  state  of  the  robot  and  the  task  identity \nwere chosen randomly.  A trial ended when the robot reached the desired final state \nor when there was a  time-out.  The time-out period was  100 for  the elemental tasks, \n200 for  C1  and  C2 ,  and  500  for  task C3 \u2022  The graphs in  Figure  3 show  the  number \nof actions  executed  per trial.  Separate statistics  were  accumulated for  each task. \n\nThe  rightmost graph shows  the performance  of the  two architectures on elemental \ntask  TI.  Not  surprisingly,  the  one-for-one  architecture  performs  better  because \nit  does  not  have  the  overhead  of figuring  out  which  Q-network  to  train  for  task \nT1 .  The middle  graph shows  the  performance  on task  C I  and shows  that  the  CQ(cid:173)\nL  architecture  is  able  to  perform  better  than  the  one-for-one  architecture  for  a \ncomposite task containing just two elemental tasks.  The leftmost  graph shows  the \nresults for  composite task C3  and illustrates the main point of this paper.  The one(cid:173)\nfor-one  architecture is  unable  to learn the  task,  in  fact  it  is  unable  to  perform  the \ntask more than a couple of times due to the low probability of randomly performing \nthe correct  task sequence. \n\nThis simulation shows that CQ-L is  able  to learn the decomposition of a  composite \ntask  and  that compositional learning,  due  to  transfer of training across  tasks,  can \nbe  faster  than  learning  each  composite  task  separately.  More  importantly,  CQ-L \nis  able  to  learn  to  solve  composite  tasks  that  cannot  be  solved  using  traditional \nschemes. \n\n7  DISCUSSION \n\nLearning to solve MDTs with  large state sets is difficult due to the sparseness of the \nevaluative information  and  the  low  probability  that a  randomly selected  sequence \nof actions  will be optimal.  Learning  the long  sequences of actions required  to solve \nsuch tasks can be accelerated considerably if the agent has prior knowledge of useful \nsubsequences.  Such subsequences can be learned  through experience  in learning to \n\n\f258 \n\nSingh \n\nsolve other  tasks.  In this  paper, I define  a  class of MOTs, called composite  MOTs, \nthat are structured as the temporal concatenation of simpler MOTs, called elemen(cid:173)\ntal MOTs.  I present CQ-L, an architecture that combines the Q-Iearning algorithm \nof Watkins  (1989)  and  the modular  architecture of Jacobs et al.  (1991)  to achieve \ntransfer of learning by sharing the solutions of elemental tasks across  multiple com(cid:173)\nposite tasks.  Given a set of composite and elemental MOTs,  the sequence in which \nthe learning agent receives training experiences on the different tasks determines the \nrelative advantage of CQ-L over other architectures that learn the tasks separately. \nThe simulation reported in Section 6 demonstrates that it is  possible to train CQ-L \non intermixed trials  of elemental and composite  tasks.  Nevertheless,  the  ability of \nCQ-L to scale  well to complex sets of tasks will depend on the choice of the training \nsequence. \n\nAcknowledgements \n\nThis  work  was  supported  by  the  Air  Force  Office  of  Scientific  Research,  Bolling \nAFB,  under Grant AFOSR-89-0526 and by the National Science  Foundation under \nGrant ECS-8912623.  I am  very grateful to Andrew  Barto for  his extensive help  in \nformulating  these  ideas and  preparing this  paper. \n\nReferences \n\nJ .  R.  Bachrach.  (1991)  A  connectionist  learning  control  architecture  for  naviga(cid:173)\ntion.  In  R.  P.  Lippmann,  J.  E.  Moody,  and  D.  S.  Touretzky,  editors,  Adv4nce6  in \nNeural Information  Proceuing Sy6tem6  3,  pages 457-463,  San Mateo,  CA.  Morgan \nKaufmann. \nA.  G.  Barto,  S.  J.  Bradtke, and S.  P.  Singh.  (1991)  Real-time learning and control \nusing  asynchronous  dynamic  programming.  Technical  Report  91-57,  University  of \nMassachusetts,  Amherst, MA.  Submitted to  AI Journal. \nR.  A.  Jacobs.  (1990)  T46lc  decomp06ition through  competition in  a modular connec(cid:173)\ntioni6t  architecture.  PhD  thesis,  COINS  dept,  U niv.  of Massachusetts,  Amherst, \nMass.  U.S.A. \nR.  A.  Jacobs,  M.  I.  Jordan,  S.  J.  Nowlan,  and  G.  E.  Hinton.  (1991)  Adaptive \nmixtures of local  experts.  Neural  Computation,  3( 1 ). \nD. E.  Rumelhart, G. E.  Hinton, and R. J. Williams.  (1986)  Learning internal repre(cid:173)\nsentations by error  propagation.  In D.  E. Rumelhart and J. L.  McClelland, editors, \nParallel  Distributed  Proceuing:  E:cploration6  in  the  Micr06tructure  of Cognition, \nvol.1:  Found4tion6.  Bradford  Books/MIT Press,  Cambridge,  MA. \nS.  P.  Singh.  (1992)  Transfer of learning  by  composing  solutions  for  elemental se(cid:173)\nquential tasks.  Machine  Learning. \nR.  S.  Sutton.  (1988)  Learning  to  predict  by  the  methods  of temporal  differences. \nMachine  Learning,  3:9-44. \nC.  J .  C.  H.  Watkins.  (1989)  Learning  from  Delayed  Rewards.  PhD  thesis,  Cam(cid:173)\nbridge  Univ.,  Cambridge,  England. \nC.  J.  C.  H.  Watkins and  P.  Dayan.  (1992)  Q-learning.  Machine  Learning. \n\n\f", "award": [], "sourceid": 569, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}]}