{"title": "Improving Elevator Performance Using Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1023, "abstract": null, "full_text": "Improving Elevator Performance Using \n\nReinforcement Learning \n\nRobert  H.  Crites \n\nComputer Science  Department \n\nUniversity of Massachusetts \nAmherst,  MA  01003-4610 \ncritesGcs.umass.edu \n\nAndrew  G. Barto \n\nComputer Science  Department \n\nUniversity of Massachusetts \nAmherst,  MA  01003-4610 \n\nbartoGcs.umass.edu \n\nAbstract \n\nThis paper describes the application of reinforcement learning (RL) \nto the difficult  real  world problem of elevator dispatching.  The el(cid:173)\nevator domain poses a  combination of challenges not seen in  most \nRL  research to date.  Elevator systems operate in continuous state \nspaces  and in continuous time  as discrete  event  dynamic systems. \nTheir  states  are  not  fully  observable  and  they  are  nonstationary \ndue to changing passenger arrival rates.  In addition, we  use a team \nof RL  agents,  each  of which is  responsible  for  controlling one  ele(cid:173)\nvator car.  The team receives  a  global reinforcement  signal  which \nappears noisy to each agent due to the effects of the actions of the \nother agents, the random nature of the arrivals and the incomplete \nobservation of the state.  In  spite  of these complications,  we  show \nresults that in simulation surpass the best of the heuristic elevator \ncontrol  algorithms  of which  we  are  aware.  These  results  demon(cid:173)\nstrate  the  power  of RL  on  a  very  large  scale  stochastic  dynamic \noptimization problem of practical utility. \n\n1 \n\nINTRODUCTION \n\nRecent  algorithmic  and  theoretical  advances  in  reinforcement  learning  (RL)  have \nattracted widespread interest.  RL  algorithms have appeared that approximate dy(cid:173)\nnamic  programming  (DP)  on  an  incremental  basis.  Unlike  traditional  DP  algo(cid:173)\nrithms,  these  algorithms  can  perform  with  or  without  models  of the  system,  and \nthey  can  be  used  online  as  well  as  offline,  focusing  computation on  areas  of state \nspace  that are  likely  to be  visited  during  actual control.  On  very  large  problems, \nthey can  provide  computationally tractable  ways  of approximating  DP.  An  exam(cid:173)\nple  of this  is  Tesauro's TD-Gammon  system  (Tesauro,  1992j  1994;  1995),  which \nused  RL  techniques  to learn  to play strong masters  level  backgammon.  Even  the \n\n\f1018 \n\nR. H. CR~.A.G. BARTO \n\nbest human experts make poor teachers for  this class of problems since they do not \nalways  know  the  best  actions.  Even  if they  did,  the  state  space  is  so  large  that \nit would  be  difficult  for  experts to provide sufficient  training data.  RL  algorithms \nare naturally suited to this class of problems,  since  they learn on the  basis  of their \nown experience.  This paper describes the application of RL  to elevator dispatching, \nanother problem where classical DP is completely intractable.  The elevator domain \nposes  a  number  of difficulties  that  were  not  present  in  backgammon.  In  spite  of \nthese complications,  we show results  that surpass the  best of the heuristic  elevator \ncontrol algorithms of which we  are aware.  The following  sections describe  the  ele(cid:173)\nvator dispatching domain,  the  RL  algorithm and neural network architectures that \nwere used,  the results,  and some conclusions. \n\n2  THE ELEVATOR SYSTEM \n\nThe  particular  elevator  system  we  examine  is  a  simulated  10-story  building  with \n4 elevator cars  (Lewis,  1991;  Bao et al,  1994).  Passenger arrivals at each floor  are \nassumed  to  be  Poisson,  with  arrival rates  that  vary during  the  course  of the day. \nOur simulations use a traffic profile (Bao et al,  1994) which dictates arrival rates for \nevery  5-minute interval during a  typical afternoon  down-peak rush  hour.  Table  1 \nshows  the  mean  number  of passengers  arriving  at  each  floor  (2-10)  during  each \n5-minute interval  who  are  headed  for  the  lobby.  In addition,  there  is  inter-floor \ntraffic  which varies from  0%  to  10% of the traffic to the lobby. \n\nTable  1:  The  Down-Peak Traffic  Profile \n\nThe system dynamics are approximated by the following  parameters: \n\n\u2022  Floor time  (the time to move one floor  at the maximum speed):  1.45  secs. \n\u2022  Stop  time  (the  time  needed  to decelerate,  open  and  close  the  doors,  and \n\naccelerate again):  7.19  secs. \n\n\u2022  Turn time  (the  time needed for  a stopped car  to change direction):  1 sec. \n\u2022  Load  time  (the  time  for  one  passenger  to  enter  or  exit  a  car):  random \nvariable from  a  20th order truncated Erlang distribution with a range from \n0.6  to 6.0 secs  and  a mean of 1 sec. \n\n\u2022  Car capacity:  20  passengers. \n\nThe state space  is  continuous  because  it includes  the elapsed  times  since  any  hall \ncalls  were  registered.  Even if these  real values are  approximated as  binary values, \nthe  size  of the  state  space  is  still  immense.  Its  components include  218  possible \ncombinations  of the  18  hall  call  buttons  (up  and  down  buttons  at  each  landing \nexcept  the top  and bottom),  240  possible combinations of the 40  car buttons,  and \n184  possible combinations of the positions and directions of the cars  (rounding off \nto the nearest floor).  Other parts of the state are not fully observable, for example, \nthe desired destinations of the passengers waiting at each floor.  Ignoring everything \nexcept the configuration of the hall  and car call buttons and the approximate posi(cid:173)\ntion and direction of the cars,  we  obtain an extremely conservative estimate of the \nsize  of a  discrete approximation to the continuous state space: \n\n\fImproving  Elevator Performance  Using  Reinforcement  Learning \n\n1019 \n\nEach car has a small set of primitive actions.  Ifit is stopped at a floor,  it must either \n\"move up\"  or  \"move down\".  If it is in motion between floors,  it  must  either  \"stop \nat the next floor\"  or  \"continue past the next floor\".  Due  to passenger expectations, \nthere are  two constraints on  these  actions:  a  car cannot pass a  floor if a  passenger \nwants to get off there and cannot turn until it has serviced all the car buttons in its \npresent direction.  We  have added three additional action constraints in an attempt \nto  build  in  some  primitive  prior  knowledge:  a  car  cannot  stop  at  a  floor  unless \nsomeone wants to get on or off there, it cannot stop to pick up passengers at a floor \nif another car is already stopped there,  and given a  choice  between moving up and \ndown, it should  prefer  to move  up  (since  the down-peak  traffic  tends  to push  the \ncars  toward the  bottom of the building).  Because of this  last  constraint,  the  only \nreal choices left  to each car are  the stop and continue actions.  The  actions of the \nelevator cars are executed asynchronously since  they may take different amounts of \ntime to complete. \n\nThe performance objectives of an elevator system can be defined in many ways.  One \npossible objective is to minimize the  average  wait time,  which is  the time  between \nthe  arrival  of a  passenger  and  his  entry into a  car.  Another  possible  objective is \nto  minimize  the average  6y6tem  time,  which is  the  sum of the  wait  time  and  the \ntravel time.  A  third possible objective is  to minimize  the percentage of passengers \nthat wait longer than some dissatisfaction threshold (usually  60  seconds).  Another \ncommon  objective  is  to  minimize  the  sum  of 6quared  wait  times.  We  chose  this \nlatter performance  objective  since  it  tends  to  keep  the  wait  times  low  while  also \nencouraging fair  service. \n\n3  THE ALGORITHM AND  NETWORK \n\nARCHITECTURE \n\nElevator systems can be modeled as  ducrete  event systems, where significant events \n(such as passenger arrivals) occur at discrete times, but the amount oftime between \nevents is  a  real-valued  variable.  In  such  systems,  the  constant  discount  factor  'Y \nused  in  most  discrete-time  reinforcement  learning  algorithms  is  inadequate.  This \nproblem  can  be  approached  using  a  variable  discount  factor  that  depends  on  the \namount  of time  between  events  (Bradtke  &  Duff,  1995).  In  this  case,  returns  are \ndefined  as integrals rather than as infinite  sums,  as follows: \n\nbecomes \n\nwhere  rt  is  the  immediate  cost  at  discrete  time  t,  r.,.  is  the  instantaneous cost  at \ncontinuous time T  (e.g., the sum of the squared wait times of all waiting passengers), \nand {3  controls the rate of exponential decay. \nCalculating reinforcements here poses a  problem in that it seems  to require knowl(cid:173)\nedge  of the  waiting times of all  waiting passengers.  There are  two ways  of dealing \nwith this problem.  The simulator knows how long each passenger has been waiting. \nIt  could  use  this information  to  determine  what  could  be  called  omnucient  rein(cid:173)\nforcements.  The other possibility is to use only information that would be available \nto a  real  system online.  Such  online  reinforcements  assume  only  that  the  waiting \ntime  of  the  first  passenger  in  each  queue  is  known  (which  is  the  elapsed  button \ntime).  If the  Poisson arrival rate A for  each queue is  estimated as  the reciprocal of \nthe  last inter-button time for  that queue,  the  Gamma distribution can  be  used  to \nestimate the arrival times of subsequent  passengers.  The time  until the nth.  subse(cid:173)\nquent arrival follows  the Gamma distribution  r(n, f).  For each queue,  subsequent \n\n\f1020 \n\nR.  H.  CRITES, A. G. BARTO \n\narrivals will generate the following expected penalties during the first b seconds after \nthe  hall  button has been pressed: \n\n00  rb \nL  Jo \n\nn=l \n\n0 \n\n(prob nth arrival occurs at time r)  . (penalty given arrival at time r) dr \n\nThis integral  can  be  solved  by  parts  to  yield  expected  penalties.  We  found  that \nusing online  reinforcements  actually  produced somewhat  better results  than  using \nomniscient reinforcements,  presumably  because  the  algorithm  was  trying  to learn \naverage values anyway. \n\nBecause  elevator system events occur randomly in continuous time,  the  branching \nfactor  is  effectively  infinite,  which  complicates  the  use  of algorithms  that  require \nexplicit  lookahead.  Therefore,  we  employed  a  team  of discrete-event  Q-Iearning \nagents,  where  each  agent  is  responsible  for  controlling  one  elevator  car.  Q(:z:, a) \nis  defined  as  the  expected  infinite  discounted  return  obtained  by  taking  action  a \nin  state  :z:  and  then  following  an optimal  policy  (Watkins,  1989).  Because  of the \nvast number of states, the Q-values are stored in feedforward neural networks.  The \nnetworks receive  some  state information as  input,  and produce  Q-value estimates \nas output.  We have tested two architectures.  In the parallel architecture, the agents \nshare  a  single  network,  allowing them  to learn  from  each  other's experiences  and \nforcing  them to learn identical policies.  In the fully  decentralized architecture, the \nagents  have  their  own networks,  allowing them to specialize  their control policies. \nIn  either  case,  none  of the  agents  have  explicit  access  to  the  actions  of the  other \nagents.  Cooperation has to be learned indirectly via the global reinforcement signal. \nEach agent faces  added stochasticity and nonstationarity because its environment \ncontains  other  learning  agents.  Other  work  on  team  Q-Iearning  is  described  in \n(Markey,  1994). \nThe  algorithm  calls  for  each  car  to  select  its  actions  probabilistic ally  using  the \nBoltzmann distribution over  its Q-value estimates,  where  the temperature is  low(cid:173)\nered gradually during training.  After every decision,  error  backpropagation is  used \nto train the car's estimate of Q(:z:, a)  toward the following  target  output: \n\nwhere  action  a  is  taken  by  the  car from  state  :z:  at  time  t x ,  the  next  decision  by \nthat  car  is  required  from  state  y  at  time  ty,  and  TT  and (3  are  defined  as  above. \ne-tJ(tv-t.)  acts  as  a  variable  discount  factor  that depends  on  the  amount  of time \nbetween events.  The learning rate parameter was set to 0.01  or 0.001  and {3  was set \nto 0.01  in the experiments described in this paper. \n\nAfter considerable experimentation, our best  results were  obtained using  networks \nfor  pure down traffic  with 47 input  units,  20  hidden sigmoid units,  and two linear \noutput units  (one  for  each action value).  The input  units are as follows: \n\n\u2022  18  units:  Two units  encode information about each of the  nine  down  hall \nbuttons.  A  real-valued  unit  encodes  the  elapsed  time  if the  button  has \nbeen pushed and a  binary unit is on if the  button has not  been pushed. \n\n\fImproving Elevator Performance  Using  Reinforcement  Learning \n\n1021 \n\n\u2022  16  units:  Each of these  units  represents  a  possible  location  and  direction \nfor  the car whose decision is  required.  Exactly one of these units will be on \nat any given time. \n\n\u2022  10 units:  These units each represent one of the 10 floors where the other cars \nmay  be  located.  Each car has  a  \"footprint\"  that  depends on its direction \nand  speed.  For example,  a  stopped car causes  activation only  on  the  unit \ncorresponding  to its  current floor,  but  a  moving  car  causes  activation  on \nseveral units corresponding to the floors it is approachmg, with the highest \nactivations on the closest floors. \n\n\u2022  1 unit:  This unit is  on if the car whose decision is required is at the highest \n\nfloor  with a  waiting passenger. \n\n\u2022  1 unit:  This  unit is  on if the car  whose  decision is required is  at the floor \nwith the passenger that has been waiting for  the longest  amount of time. \n\n\u2022  1 unit:  The bias unit is  always on. \n\n4  RESULTS \n\nSince  an optimal policy for  the elevator dispatching problem is  unknown, we  mea(cid:173)\nsured the performance of our algorithm against other heuristic algorithms, including \nthe  best of which  we  were  aware.  The algorithms  were:  SECTOR,  a  sector-based \nalgorithm similar to what is  used in many actual elevator systems;  DLB,  Dynamic \nLoad  Balancing,  attempts  to  equalize  the  load  of all  cars;  HUFF,  Highest  Unan(cid:173)\nswered  Floor  First,  gives  priority  to  the  highest  floor  with  people  waiting;  LQF, \nLongest  Queue  First,  gives  priority  to  the  queue  with  the  person  who  has  been \nwaiting for  the longest  amount of time;  FIM,  Finite Intervisit Minimization,  a  re(cid:173)\nceding  horizon  controller  that searches  the space  of admissible  car  assignments  to \nminimize  a  load function;  ESA,  Empty the System  Algorithm,  a  receding  horizon \ncontroller that searches for the fastest way to \"empty the system\" assuming no new \npassenger arrivals.  ESA uses queue length information that would  not  be available \nin a real elevator system.  ESA/nq is a version of ESA that uses arrival rate informa(cid:173)\ntion  to estimate  the queue lengths.  For  more details,  see  (Bao et  al,  1994).  These \nreceding  horizon  controllers are  very  sophisticated,  but  also  very computationally \nintensive,  such  that  they  would  be  difficult  to  implement  in  real  time.  RLp  and \nRLd denote the RL controllers, parallel and decentralized.  The RL controllers were \neach trained on 60,000 hours of simulated elevator time, which took four  days on a \n100 MIPS workstation.  The results are averaged over 30 hours of simulated elevator \ntime.  Table 2 shows the results for  the traffic  profile  with down traffic only. \n\nAlgorithm \nSECTOR \n\nDLB \n\nBASIC  HUFF \n\nLQF \nHUFF \nFIM \n\nESA/nq \n\nESA \nRLp \nRLd \n\nI AvgWait  I SquaredWait  I SystemTime  I Percent>60 secs  I \n\n21.4 \n19.4 \n19.9 \n19.1 \n16.8 \n16.0 \n15.8 \n15.1 \n14.8 \n14.7 \n\n674 \n658 \n580 \n534 \n396 \n359 \n358 \n338 \n320 \n313 \n\n47.7 \n53.2 \n47.2 \n46.6 \n48.6 \n47.9 \n47.7 \n47.1 \n41.8 \n41.7 \n\n1.12 \n2.74 \n0.76 \n0.89 \n0.16 \n0.11 \n0.12 \n0.25 \n0.09 \n0.07 \n\nTable 2:  Results for  Down-Peak Profile  with  Down Traffic  Only \n\n\f1022 \n\nR.H.C~.A.G. BARTO \n\nTable 3 shows the results for  the down-peak traffic profile with up and down traffic, \nincluding  an  average  of 2  up  passengers  per  minute at  the  lobby.  The  algorithm \nwas  trained  on  down-only  traffic,  yet  it  generalizes  well  when  up  traffic  is  added \nand upward moving cars are forced  to stop for  any upward hall calls. \n\nAlgorithm \nSECTOR \n\nDLB \n\nBASIC HUFF \n\nLQF \nHU ... \u00b7F \nESA \nFIM \nRLp \nRLd \n\nI AvgWait  I Squared wait I SystemTime  I Percent>60 secs  I \n\n27.3 \n21.7 \n22.0 \n21.9 \n19.6 \n18.0 \n17.9 \n16.9 \n16.9 \n\n1252 \n826 \n756 \n732 \n608 \n524 \n476 \n476 \n468 \n\n54.8 \n54.4 \n51.1 \n50.7 \n50.5 \n50.0 \n48.9 \n42.7 \n42.7 \n\n9.24 \n4.74 \n3.46 \n2.87 \n1.99 \n1.56 \n0.50 \n1.53 \n1.40 \n\nTable 3:  Results for  Down-Peak Profile with Up and  Down Traffic \n\nTable 4 shows the results for  the down-peak traffic profile with up and down traffic, \nincluding an average of 4 up passengers per minute at the lobby.  This time there is \ntwice as  much up traffic,  and  the  RL  agents generalize  extremely well  to this  new \nsituation. \n\nAlgorithm \nSECTOR \n\nHUFF \nDLB \nLQF \n\nBASIC HUFF \n\nFIM \nESA \nRLd \nRLp \n\nI AvgWait  I SquaredWait I SystemTime  I Percent>60 secs  I \n\n30.3 \n22.8 \n22.6 \n23.5 \n23.2 \n20.8 \n20.1 \n18.8 \n18.6 \n\n1643 \n884 \n880 \n877 \n875 \n685 \n667 \n593 \n585 \n\n59.5 \n55.3 \n55.8 \n53.5 \n54.7 \n53.4 \n52.3 \n45.4 \n45.7 \n\n13.50 \n5.10 \n5.18 \n4.92 \n4.94 \n3.10 \n3.12 \n2.40 \n2.49 \n\nTable 4:  Results for  Down-Peak Profile  with Twice as  Much  Up Traffic \n\nOne  can see  that both the RL  systems  achieved very good  performance,  most  no(cid:173)\ntably as measured by system time (the sum of the wait and travel time),  a  measure \nthat was  not  directly  being minimized.  Surprisingly,  the  decentralized  RL  system \nwas able to achieve as good a  level of performance as the parallel RL system.  Bet(cid:173)\nter performance with nonstationary traffic  profiles  may  be obtainable by providing \nthe agents with information about the current traffic context as part of their input \nrepresentation.  We  expect  that an additional  advantage of RL  over  heuristic  con(cid:173)\ntrollers may be in buildings with less homogeneous arrival rates at each floor,  where \nRL can adapt  to idiosyncracies in their individual traffic patterns. \n\n5  CONCLUSIONS \n\nThese results demonstrate the utility of RL on a very large scale dynamic optimiza(cid:173)\ntion  problem.  By  focusing  computation  onto the  states  visited  during  simulated \ntrajectories,  RL  avoids  the  need  of conventional  DP  algorithms  to  exhaustively \n\n\fImproving  Elevator Performance  Using  Reinforcement Learning \n\n1023 \n\nsweep the state set.  By storing information in  artificial neural networks, it avoids \nthe  need  to  maintain large  lookup  tables.  To achieve  the  above  results,  each  RL \nsystem experienced  60,000  hours  of simulated elevator time,  which  took four  days \nof computer time on a  100  MIPS processor.  Although this is a considerable amount \nof computation, it is  negligible  compared  to  what any conventional  DP  algorithm \nwould  require.  The results  also  suggest  that  approaches  to  decentralized  control \nusing  RL  have  considerable  promise.  Future  research  on  the  elevator  dispatching \nproblem  will  investigate  other  traffic  profiles  and  further  explore  the  parallel  and \ndecentralized RL  architectures. \n\nAcknowledgements \n\nWe  thank  John  McNulty,  Christ os  Cassandras,  Asif Gandhi,  Dave  Pepyne,  Kevin \nMarkey,  Victor  Lesser,  Rod  Grupen,  Rich  Sutton,  Steve  Bradtke,  and  the  ANW \ngroup  for  assistance  with  the simulator  and for  helpful discussions.  This  research \nwas  supported  by  the  Air  Force  Office  of Scientific  Research under grant  F49620-\n93-1-0269. \n\nReferences \n\nG.  Bao,  C.  G.  Cassandras, T.  E.  Djaferis,  A.  D.  Gandhi, and  D.  P.  Looze.  (1994) \nElevator  Di,patcher, for  Down  Peale  Traffic.  Technical Report,  ECE  Department, \nUniversity of Massachusetts,  Amherst,  MA. \n\nS.  J.  Bradtke  and  M.  O.  Duff. \n(1995)  Reinforcement  Learning  Methods  for \nContinuous-Time  Markov  Decision  Problems.  In:  G.  Tesauro,  D.  S.  Touretzky \nand T. K.  Leen,  eds.,  Advance, in Neural Information Procelling Sy,tem,  7,  MIT \nPress,  Cambridge, MA. \n\nJ. Lewis.  (1991)  A  Dynamic Load Balancing Approach to the  Control of Multuerver \nPolling  Sy,tem,  with  Applicationl  to  Elevator  Syltem  Dupatching.  PhD  thesis, \nUniversity of Massachusetts,  Amherst,  MA. \n\nK.  L.  Markey.  (1994)  Efficient  Learning  of Multiple  Degree-of-Freedom  Control \nProblems  with  Quasi-independent  Q-agents. \nIn:  M.  C.  Mozer,  P.  Smolensky, \nD.  S.  Touretzky,  J.  L.  Elman  and  A.  S.  Weigend,  eds.,  Proceeding'  of the  1993 \nConnectionilt Modell Summer SchooL  Erlbaum Associates,  Hillsdale,  NJ. \n\nG.  Tesauro.  (1992)  Practical  Issues  in  Temporal  Difference  Learning.  Machine \nLearning 8:257-277. \n\nG. Tesauro.  (1994) TO-Gammon, a Self-Teaching Backgammon Program, Achieves \nMaster-Level Play.  Neural  Computation 6:215-219. \n\nG. Tesauro.  (1995)  Temporal Difference  Learning and TD-Gammon.  Communica(cid:173)\ntion,  of the  ACM 38:58-68. \n\nC.  J.  C.  H.  Watkins.  (1989)  Learning  from  Delayed  Reward,.  PhD  thesis,  Cam(cid:173)\nbridge  University. \n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "Robert", "family_name": "Crites", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}