{"title": "Hierarchical Memory-Based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1047, "page_last": 1053, "abstract": null, "full_text": "Hierarchical  Memory-Based \n\nReinforcement  Learning \n\nNatalia Hernandez-Gardio} \n\nArtificial Intelligence Lab \n\nMassachusetts Institute of Technology \n\nCambridge, MA  02139 \n\nnhg@ai.mit.edu \n\nAbstract \n\nSridhar  Mahadevan \n\nDepartment of Computer Science \n\nMichigan State University \nEast Lansing,  MI  48824 \nmahadeva@cse.msu.edu \n\nA  key  challenge  for  reinforcement  learning  is  scaling  up  to  large \npartially  observable  domains.  In this paper,  we  show  how  a  hier(cid:173)\narchy of behaviors can be used to create and select among variable \nlength short-term memories  appropriate for  a task.  At  higher lev(cid:173)\nels  in  the  hierarchy,  the  agent  abstracts  over  lower-level  details \nand  looks  back  over  a  variable  number  of high-level  decisions  in \ntime.  We  formalize  this  idea  in  a  framework  called  Hierarchical \nSuffix Memory (HSM).  HSM  uses a memory-based SMDP learning \nmethod  to  rapidly  propagate  delayed  reward  across  long  decision \nsequences.  We  describe  a  detailed  experimental  study  comparing \nmemory  vs.  hierarchy  using  the  HSM  framework  on  a  realistic \ncorridor navigation task. \n\n1 \n\nIntroduction \n\nReinforcement learning encompasses a  class of machine learning problems in which \nan  agent  learns from  experience  as  it  interacts  with  its  environment.  One funda(cid:173)\nmental  challenge faced  by  reinforcement  learning  agents  in  real-world  problems  is \nthat the state space can be very large, and consequently there may be a  long delay \nbefore reward is received.  Previous work has addressed this issue by breaking down \na  large task into a  hierarchy of subtasks or abstract behaviors  [1,  3,  5]. \n\nAnother  difficult  issue  is  the  problem  of  perceptual  aliasing:  different  real-world \nstates can often generate the same observations.  One strategy to deal with percep(cid:173)\ntual aliasing is to add memory about past percepts.  Short-term memory consisting \nof a  linear  (or tree-based)  sequence of primitive actions and observations has  been \nshown  to  be  a  useful  strategy  [2].  However,  considering  short-term  memory  at  a \nflat,  uniform  resolution of primitive actions  would likely scale poorly to tasks with \nlong  decision  sequences.  Thus,  just  as  spatio-temporal  abstraction  of  the  state \nspace  improves  scaling  in  completely  observable  environments,  for  large  partially \nobservable  environments  a  similar  benefit  may  result  if  we  consider  the  space  of \npast experience  at variable resolution.  Given a task, we want a hierarchical strategy \nfor  rapidly bringing to bear past experience that is  appropriate to the grain-size of \nthe decisions  being considered. \n\n\fabstraction level: navigation \n\nabstraction level: traversal \n\ncomer \n\nT-junction \n\ndead end \n=::J \n\n\", \n\nC \n\nI I  \n_0 ..01 \n\nIi \no D3 \n~  Y \n-\n\n_ O D3 \n\n_ O D2 \n\nI I  \n_ 0  D1 \n\nIi \n_ O D3 \n\">. \n\nI I  \n/ \n-0 - ' \n* '---.... \n\n- - -\n\nJ-\no~~ocl! ... O~ , \n'---v--:J \n\n* \" \n\n~ \n\nabstraction level: primitive \n\ni \no  .. 0 \n\n.. 0 \n\n..\n\n. \n\n.. ~ g .. 0 \n\n.. 0 \n\n..\n\n. \n\n... ~ \n\ni \n0 \n\n.. 0 \n\n.. 0 \n\n..\n\n.. ~ \n\nFigure 1:  This figure  illustrates memory-based decision making at two levels in the \nhierarchy  of a  navigation  task.  At  each  level,  each  decision  point  (shown  with  a \nstar)  examines its  past  experience  to find  states with  similar  history  (shown  with \nshadows).  At  the  abstract  (navigation)  level,  observations  and  decisions  occur  at \nintersections.  At  the  lower  (corridor-traversal)  level,  observations  and  decisions \noccur within the corridor. \n\nIn  this  paper,  we  show  that  considering  past  experience  at  a  variable,  task(cid:173)\nappropriate resolution can speed up learning and greatly improve performance un(cid:173)\nder  perceptual  aliasing.  The resulting approach,  which  we  call  Hierarchical Suffix \nMemory  (HSM),  is  a general technique for solving large, perceptually aliased tasks. \n\n2  Hierarchical Suffix Memory \n\nBy  employing  short-term  memory  over  abstract  decisions,  each  of which  involves \na  hierarchy  of  behaviors,  we  can  apply  memory  at  a  more  informative  level  of \nabstraction.  An important side-effect is that the agent  can look at a  decision point \nmany steps back in time while ignoring the exact sequence of low-level observations \nand actions that transpired.  Figure 1 illustrates the HSM  framework. \n\nThe problem of learning under perceptual aliasing can be viewed  as discovering an \ninformative sequence of past actions and observations (that is,  a history suffix) for a \ngiven world state that enables an agent to act optimally in the world.  We can think \nof each situation in which an agent must choose an action  (a choice point)  as  being \nlabeled with a pair  [0\",  l]:  l  refers to the abstraction level and 0\"  refers to the history \nsuffix.  In the  completely observable case,  0\"  has a  length of one,  and decisions  are \nmade based on  the  current observation.  In the partially observable case,  we  must \nadditionally consider past history when making decisions.  In this case, the suffix  0\", \nis  some sequence of past observations and actions that must be learned.  This idea \nof representing  memory  as  a  variable-length  suffix  derives  from  work  on  learning \napproximations of probabilistic suffix automata [2,  4]. \n\nHere is the general HSM procedure (including model-free and model-based updates): \n\n1.  Given an abstraction levell and choice point  s  within l:  for  each potential \nfuture decision,  d,  examine the history at level l  to find  a  set of past choice \npoints that have executed d and whose incoming (suffix) history most closely \nmatches  that  of the  current  point.  Call  this  set  of instances  the  \"voting \nset\"  for  decision  d. \n\n2.  Choose dt  as the decision with the highest average discounted sum of reward \nover the  voting set.  Occasionally,  choose  dt  using  an exploration strategy. \n\n\fHere,  t  is the event  counter of the current choice  point  at level  l. \n\n3.  Execute  the  decision  dt  and  record:  0t,  the  resulting  observation;  Tt,  the \nreward  received;  and nt, the  duration  of abstract  action dt  (measured  by \nthe number of primitive environment transitions executed by the  abstract \naction). \nNote that for every environment transition from state Si-l to state Si  with \nreward Ti  and  discount  I, we  accumulate  any  reward and  update the  dis-\ncount factor: \n\nTt  ~ Tt  + ItTi \n\nIt ~ lIt \n\n4.  Update  the  Q-value  for  the  current  decision  point  and  for  each  instance \nin the  voting set  using the decision,  reward,  and duration values  recorded \nalong with the instance. \nModel-free:  use  an SMDP  Q-Iearning update rule  ((3  is the learning rate): \n\nQI(St, dt ) ~ (1- (3)QI(St, dt ) + (3h + It max QI(St+n\" d)) \n\nd \n\nModel-based:  if a  state-transition  model  is  being  used,  a  sweep  of value \niteration  can  be  executed1 .  Let  the  state  corresponding  to  the  decision \npoint  at time t  be represented by the suffix  s: \n\nQI(s,dt ) ~ RI(S,dt ) + 2:l1(SI I s,dt)\"Vi(S')(, Ndt ) \n\ns' \n\nwhere  RI(S, dt ) is the estimated immediate reward from  executing decision \ndt  from  the  choice  point  [s, l];  FI(S'  I  s, dt )  is  the  estimated  probability \nthat  the agent  arrives  in  [s',l]  given  that  it  executed  dt  from  [s,l];  Vt(S') \nis  the utility of the situation  [S', l];  and  Nd t  is  the average duration of the \ntransition  [s,l]  to  [s',l]  under abstract  action dt. \n\nHSM  requires  a  technique  for  short-term  memory.  We  implemented  the  Nearest \nSequence Memory (NSM)  and Utile Suffix  Memory  (USM)  algorithms proposed by \nMcCallum [2].  NSM  records each of its raw experiences as a linear chain.  To choose \nthe next  action,  the agent  evaluates  the  outcomes of the  k  \"nearest\"  neighbors  in \nthe  experience  chain.  NSM  evaluates  the  closeness  between  two  states  according \nto the match length of the suffix  chain preceding the states.  The chain  can either \nbe  grown indefinitely,  or old  experiences  can be  replaced  after the  chain reaches  a \nmaximum length.  With NSM,  a  model-free learning method,  HSM  uses  an SMDP \nQ-Iearning rule  as  described  above.  USM  also  records  experience  in  a  linear  time \nchain.  However, instead of attempting to choose actions based on a  greedy history \nmatch, USM tries to explicitly determine how much memory is useful for predicting \nreward.  To  do  this,  the agent  builds  a  tree-like  structure for  state representation \nonline, selectively adding depth to the tree if the additional history distinction helps \nto  predict  reward.  With  USM,  which  learns  a  model,  HSM  updates the  Q-values \nby doing one sweep  of value  iteration with the leaves of the tree as  states. \n\nFinally,  to  implement  the  hierarchy of behaviors,  in  principle  any  hierarchical  re(cid:173)\ninforcement  learning  method  may  be  used.  For  our  implementation,  we  used  the \nHierarchy  of Abstract  Machines  (HAM)  framework  proposed  by  Parr and  Russell \n[3].  When executed, an abstract machine executes a partial policy and returns con(cid:173)\ntrol to the caller upon termination.  The HAM  architecture uses  a  Q-Iearning rule \nmodified for  SMDPs. \n\nlIn this context,  \"state\"  is  represented by the history suffix.  That is, an instance is in \na  \"state\"  if the instance's  incoming history matches the suffix representing  the state.  In \nthis  case,  the  voting  set  is  exactly  the  set  of instances  in  the  same  state  as  the  current \nchoice  point  8t \n\n\fShor1 SeilS  Homad(l) \n\nJ \n\nIhl'l-~  bJlJn~\"':  LL (-w\"'r'lR4~.-O I~U;4, c:n ,  (Jl( ..... 'JWM75, . 00002169l \nitctuat  PO,ltlon::\u00b7=\u00b7,\u00b7,-.,.,-::<lI\"Y=-;).'.JO!'iSS=I \u2022... <lT=(1(Q3 \nEncoder  po\"lt.lr~:  :{ =\"OOJo<;~\"'5  Y=-l'.\"u oj l,7  5=,0';01'1  T=O(o'\u20ac \nCo1'lp(lssl.,<,1,,,:>:r,:3::' \nP,\u00b7\"\" ;UUtl.\" ,, ... J;:: I (: \nUnit~ :  C(JOI\"\u00b7d ulJt ,...'\" 0.1  ' rie hl',,;  iJ,.,1 n\" 0. 1  d~. ,\u00b7~~\" \n\nFigure  2:  The corridor environment in  the  Nomad  200  robot  simulator.  The goal \nis  the  4-way junction.  The  robot  is  shown  at  the  middle  T-junction.  The  robot \nis  equipped with  16  short-range infrared  and long-range sonar sensors.  The other \nfigures  in the environment are obstacles around which the robot must maneuver. \n\n3  The Navigation Task \n\nTo  test  the HSM  framework,  we  devised  a  navigation task in  a  simulated corridor \nenvironment  (see  Figure 2).  The task is for the robot to find  its way from  the start, \nthe  center  T-junction,  to  the  goal,  the  four-way  junction.  The  robot  receives  a \nreward at the goal intersection and and a small negative reward for  each primitive \nstep taken. \n\nOur  primary  testbed  was  a  simulated  agent  using  a  Nomad  200  robot  simulator. \nThis simulated robot is equipped with 20 bumper and 16 sonar and infrared sensors, \narranged radially.  The  dynamics  of the  simulator are  not  \"grid  world\"  dynamics: \nthe  Nomad  200  simulator represents  continuous,  noisy  sensor input  and the occa(cid:173)\nsional  unreliability of actuators.  The  environment  presents  significant  perceptual \nambiguity.  Additionally,  sensor  readings  can  be  noisy;  even  if the agent  is  at the \ngoal or an intersection, it might  not  \"see\"  it.  Note  the size  of the robot relative to \nthe environment in Figure 2. \n\nWhat makes the task difficult  are the several activities that must be executed con(cid:173)\ncurrently.  Conceptually, there are two levels to our navigation problem.  At the top, \nmost  abstract,  level  is  the  root  task  of navigating to  the  goal.  At  the  lower  level \nis  the  task  of physically  traversing  the  corridors,  avoiding  obstacles,  maintaining \nalignment  with the walls,  etc. \n\n4 \n\nImplementation of the Learning Agents \n\nIn our experiments, we  compared several learning agents:  a  basic HAM  agent, four \nagents  using  HSM  (each  using  a  different  short-term  memory  technique),  and  a \n\"flat\"  NSM  agent. \nTo  build  a set  of behaviors for  hallway navigation,  we  used  a three-level hierarchy. \nThe top abstract level is  basically a  choice state for  choosing a  hallway navigation \ndirection  (see  Figure 3a).  In each of the four  nominal  directions  (front,  back,  left, \nright), the agent  can make one of three observations:  wall,  open, or unknown.  The \nagent  must  learn  to  choose  among  the  four  abstract  machines  to  reach  the  next \n\n\fgo orwar \n\n(a) \n\nFigure  3:  Hierarchical  structure  of behaviors  for  hallway  navigation.  Figure  (a) \nshows the most abstract level - responsible for  navigating in the environment.  Fig(cid:173)\nures  (b)  and  (c)  show  two  implementations  of  the  hall-traversal  machines.  The \nmachine in Figure  (b)  is  reactive, and Figure  (c)  is  a  machine  with a  choice point. \n\nintersection.  This top level  machine  has control  initially,  and it  regains  control at \nintersections.  The second level of the hierarchy contains the machines for traversing \nthe hallway.  The traversal behavior is shown in Figure 3b.  Each of the four machines \nat  this  level  executes  a  reactive  strategy  for  traversing  a  corridor.  Finally,  the \nthird level of the hierarchy implements the follow-wall and avoid-obstacle strategies \nusing primitive actions.  Both the avoid-obstacle and the follow-wall strategies were \nthemselves trained previously using Q-Iearning to exploit the power of reuse in the \nhierarchical framework. \n\nThe  HAM  agent  uses  a  three-level  behavior  hierarchy  as  described  above.  There \nis  a  single  choice  state,  at  the  top  level,  and  the  agent  learns  to  coordinate  its \nchoices by keeping a table of Q-values.  The Q-value table is  indexed by the current \npercepts  and the  chosen  action  (one  of four  abstract  machines).  The  HAM  agent \nuses a  discount of 0.9,  and a learning rate of 0.1.  Exploration is  done with a  simple \nepsilon-greedy strategy. \n\nThe first  pair of HSM  agents use  the same behavior hierarchy as the HAM  agent. \nHowever, they use short-term memory at the most abstract level to learn a strategy \nfor navigating the corridor.  The first  of these agents uses NSM at the top level with \na  history length of 1000,  k  =  4,  a  discount  of 0.9,  and a  learning rate of 0.1.  The \nsecond  agent  uses  USM  at  the top level  with  a  discount  of 0.95.  The performance \nof these top-level memory agents was studied as a control against the more complex \nmulti-level memory agents described next. \n\nThe next pair of HSM agents use short-term memory both at the abstract navigation \nlevel  and  at  the  intermediate  level.  The  behavior  decomposition  at  the  abstract \nnavigation level is the same for the previous agents; however, the traversal behavior \nis  in  turn  composed  of machines  that  must  make  a  decision  based  on  short-term \nmemory.  Each  of the  machines  at  the  traversal level  uses  short-term  memory  to \nlearn  to  coordinate  a  strategy  behaviors for  traversing  a  corridor.  The  memory(cid:173)\nbased  version  of the  traversal  machine  is  shown  in  Figure  3c.  The  first  of these \nagents uses NSM as the short-term memory technique at both levels of the hierarchy. \n\n\fIt uses  a  history  length  of 1000,  k  = 4,  a  discount  of 0.9,  and  a  learning  rate of \n0.1.  The second  agent  uses  USM  as  the short-term memory technique  at  the top \nlevel with a  discount of 0.95.  At the intermediate level, it uses  NSM  with the same \nlearning  parameters  as  the  preceding  agent.  Exploration  is  done  with  a  simple \nepsilon-greedy strategy in  all cases. \n\nFinally, we study the behavior of a  \"flat\" NSM agent.  The flat agent must keep track \nof the following  perceptual data:  first,  it needs the same perceptual information as \nthe  top-level  HAM  (so  it  can  identify  the  goal);  second,  it  needs  the  additional \nperceptual  data for  aligning  to  walls  and  for  avoiding  obstacles:  whether  it  was \nbumped,  and  the  angle  to  the  wall  (binned  into  4  groups  of 45\u00b0  each).  The  flat \nagent  chooses  among four  primitive  actions:  go-forward,  veer-left,  veer-right,  and \nback-up.  Not only must it learn to make it to the goal, it must simultaneously learn \nto align itself to walls and avoid obstacles.  The NSM  agent uses a history length of \n1000 , k = 4,  a  discount of 0.9, and a learning rate of 0.1.  Exploration is  done  with \na  simple epsilon-greedy strategy. \n\n5  Experimental Results \n\nIn Figure 4,  we  see the learning performance of each agent  in  the navigation task. \nThe graphs show  the performance advantage of both multi-level  HSM  agents over \nthe other agents.  In particular, we  find  that the flat  memory-based agent does con(cid:173)\nsiderably worse than the other three, as expected.  The flat  agent must carry around \nthe perceptual data to perform both high  and low-level behaviors.  From the point \nof view  of navigation,  this  results  in  long  strings of uninformative  corridor states \nbetween the more informative intersection states.  Since takes such  an agent  longer \nto discover patterns in its experience,  it never  quite  learns to navigate successfully \nto the goal. \n\nNext,  both  multi-level  memory-based  hierarchical  agents  outperform  the  HAM \nagent.  The  HAM  agent  does  better  at  navigation  than  the  flat  agent  since  it \nabstracts  away  the  perceptually  aliased  corridor  states.  However,  it  is  unable  to \ndistinguish  between  all  of the  intersections.  Without  the  ability  to  tell  which  T(cid:173)\njunctions  lead  to  the  goal,  and  which  to  a  dead  end,  the  HAM  agent  does  not \nperform as well.  The multi-level HSM  agents also outperform the single-level ones. \nThe  multi-level  agents  can  tune  their traversing  strategy to the  characteristics of \nthe cluttered hallway by using short-term memory at the intermediate level. \n\nFinally,  although it initially does worse, the multi-level HSM  agent with USM  soon \noutperforms  the  multi-level  HSM  agent  with  NSM.  This  is  because  the  USM  al(cid:173)\ngorithm  forces  the  agent  to  learn  a  state  representation  that  uses  only  as  much \nincoming  history  as  needed  to  predict  reward.  That  is,  it  tries  to  learn the  right \nhistory suffix for each situation rather approximating the suffix by simply matching \ngreedily on incoming history.  Learning such a representation takes some time, but, \nonce learned, produces better performance. \n\n6  Conclusions  and Future Work \n\nIn this paper we  described a framework for  solving large perceptually aliased tasks \ncalled Hierarchical Suffix Memory (HSM).  This approach uses a hierarchical behav(cid:173)\nioral  structure  to  index  into  past  memory  at  multiple  levels  of resolution.  Orga(cid:173)\nnizing  past  experience  hierarchically  scales  better to  problems  with  long  decision \nsequences.  We  presented an experiment  comparing six  different  learning methods, \nshowing that hierarchical short-term memory produces overall the best performance \n\n\fmulti-level memory (USM+HAM)  -\nmulti-level memory (NSM+HAM) \nno  memory ~HAM ~ \nflat memory  NSM  - - -\n\n0.0012 \n\n\"-\nas \n~ \n~ \n(!) \nB \n~ \nf-a \n.ll \n\u00a7 \nz \n\n0.001 \n\no.ooos \n\n0.0006 \n\n0.0004 \n\n0.0002 \n\n0.0012 \n\n0.001 \n\n\"-\nas \n~ \n~  o.ooos \n(!) \nB \n~  0.0006 \nf-a  0.0004 \n.ll \n\u00a7 \nz \n\n0.0002 \n\nmulti-level memory (USM+HAM)  -\nmulti-level memory (NSM+HAM) \ntop-level only memory (USM+HAM) \ntop-level only memory (NSM+HAM)  ----\n\n..... ,. \n\n-.------.~-'.------\n\n10000 \n\n20000 \n\n30000 \nNumber of Pnmltlve Steps \n\n40000 \n\n0 \n\n0 \n\n10000 \n\n20000 \n\n30000 \nNumber of Primitive Steps \n\n40000 \n\nFigure 4:  Learning performance in the navigation task.  Each curve is averaged over \neight trials for  each agent. \n\nin  a  perceptually aliased corridor navigation task. \n\nOne  key  limitation  of the  current  HSM  framework  is  that  each  abstraction  level \nexamines only the history at its own level.  Allowing interaction between the memory \nstreams  at  each  level  of the  hierarchy  would  be  beneficial.  Consider  a  navigation \ntask in  which  the  decision  at  a  given  intersection  depends  on  an observation  seen \nwhile traversing the corridor.  In this case, the abstract level should have the ability \nto  \"zoom  in\"  to  inspect  a  particular  low-level  experience  in  greater  detail.  We \nexpect that pursuit of general frameworks such as HSM  to manage past experience \nat variable granularity will  lead to strategies for  control that are able to gracefully \nscale to large, partially observable  problems. \n\nAcknowledgements \n\nThis  research  was  carried  out  while  the  first  author  was  at  the  Department  of \nComputer  Science  and  Engineering,  Michigan  State  University.  This  research  is \nsupported  in  part  by  a  KDI  grant  from  the  National  Science  Foundation  ECS-\n9873531. \n\nReferences \n\n[1]  Thomas G. Dietterich.  The MAXQ  method for  hierarchical reinforcement learning.  In \nAutonomous  Robots  Journal,  Special  Issue  on Learning  in Autonomous  Robots, 1998. \n[2]  Andrew K. McCallum.  Reinforcement Learning  with  Selective  Perception  and  Hidden \n\nState.  PhD thesis,  University of Rochester,  1995. \n\n[3]  Ron  Parr.  Hierarchical  Control  and  Learning  for  Markov  Decision  Processes.  PhD \n\nthesis,  University of California at  Berkeley,  1998. \n\n[4]  Dana Ron, Yoram Singer, and Naftali Tishby.  The power of amnesia:  Learning proba(cid:173)\nbilistic  automata with variable mem ory length.  Machine  Learning,  25:117- 149,  1996. \n[5]  R.  Sutton,  D.  Precup,  and S.  Singh.  Intra-option learning  about temporally abstract \nactions.  In  Proceedings  of the  15th  International  Conference  on  Machine  Learning, \npages 556- 564,  1998. \n\n\f", "award": [], "sourceid": 1837, "authors": [{"given_name": "Natalia", "family_name": "Hernandez-Gardiol", "institution": null}, {"given_name": "Sridhar", "family_name": "Mahadevan", "institution": null}]}