{"title": "On-line Policy Improvement using Monte-Carlo Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1068, "page_last": 1074, "abstract": null, "full_text": "On-line Policy Improvement  using \n\nMonte-Carlo  Search \n\nGerald Tesauro \n\nIBM  T.  J.  Watson Research  Center \n\nP.  O.  Box 704 \n\nYorktown Heights,  NY  10598 \n\nGregory R.  Galperin \n\nMIT AI  Lab \n\n545 Technology Square \nCambridge, MA  02139 \n\nAbstract \n\nWe present a Monte-Carlo simulation algorithm for  real-time policy \nimprovement  of an  adaptive  controller.  In  the  Monte-Carlo  sim(cid:173)\nulation,  the  long-term expected  reward  of each  possible  action  is \nstatistically measured, using the initial policy to make decisions  in \neach  step of the simulation.  The action maximizing the  measured \nexpected reward is then taken, resulting in an improved policy.  Our \nalgorithm is easily parallelizable and has been implemented on the \nIBM  SP! and SP2 parallel-RISC supercomputers. \nWe  have  obtained  promising  initial  results  in  applying  this  algo(cid:173)\nrithm  to the  domain of backgammon.  Results  are  reported  for  a \nwide  variety  of initial  policies,  ranging  from  a  random  policy  to \nTD-Gammon, an extremely strong multi-layer neural  network.  In \neach case,  the Monte-Carlo algorithm gives a substantial reduction, \nby  as  much as  a  factor  of 5 or  more,  in  the error  rate  of the  base \nplayers.  The  algorithm  is  also  potentially  useful  in  many  other \nadaptive control applications in which it is possible to simulate the \nenvironment. \n\n1 \n\nINTRODUCTION \n\nPolicy  iteration,  a  widely  used  algorithm  for  solving  problems  in  adaptive  con(cid:173)\ntrol,  consists of repeatedly iterating the following policy improvement computation \n(Bertsekas,  1995):  (1)  First, a  value function  is  computed that represents  the long(cid:173)\nterm expected  reward  that would  be  obtained  by following  an initial policy.  (This \nmay be done  in several  ways,  such  as  with the  standard dynamic programming al(cid:173)\ngorithm.)  (2)  An  improved policy  is  then  defined  which  is  greedy  with respect  to \nthat value function.  Policy iteration is known to have rapid and robust convergence \nproperties,  and  for  Markov  tasks  with lookup-table state-space  representations,  it \nis  guaranteed to convergence  to the optimal policy. \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1069 \n\nIn  typical  uses  of  policy  iteration,  the  policy  improvement  step  is  an  extensive \noff-line  procedure.  For  example,  in  dynamic programming,  one  performs  a  sweep \nthrough  all states in  the state space.  Reinforcement  learning provides  another ap(cid:173)\nproach to policy improvement; recently,  several authors have investigated using RL \nin conjunction  with  nonlinear function  approximators to represent  the value func(cid:173)\ntions and/or policies (Tesauro,  1992; Crites and Barto, 1996; Zhang and Dietterich, \n1996).  These  studies  are  based  on  following  actual state-space  trajectories  rather \nthan sweeps through the full state space,  but are still too slow to compute improved \npolicies in  real  time.  Such function  approximators typically need extensive off-line \ntraining on many trajectories  before  they achieve acceptable performance levels. \n\nIn  contrast,  we  propose an on-line  algorithm for  computing an improved policy in \nreal  time.  We  use  Monte-Carlo search  to estimate Vp(z, a),  the expected  value  of \nperforming action a  in state z  and subsequently executing policy P  in all successor \nstates.  Here,  P  is  some  given  arbitrary  policy,  as  defined  by  a  \"base  controller\" \n(we  do  not care  how  P  is  defined  or  was derived;  we  only need  access  to its policy \ndecisions).  In  the  Monte-Carlo  search,  many simulated  trajectories  starting from \n(z, a)  are  generated  following  P,  and  the  expected  long-term reward  is  estimated \nby  averaging  the  results  from  each  of  the  trajectories. \n(N ote  that  Monte-Carlo \nsampling  is  needed  only  for  non-deterministic  tasks,  because  in  a  deterministic \ntask,  only  one  trajectory  starting from  (z, a)  would  need  to  be  examined.)  Hav(cid:173)\ning  estimated  Vp(z, a),  the  improved  policy  pI  at  state  z  is  defined  to  be  the \naction which produced the best estimated value in the Monte-Carlo simulation, i.e., \nPI(z) =  argmaxa Vp(z, a). \n\n1.1  EFFICIENT  IMPLEMENTATION \n\nThe  proposed  Monte-Carlo algorithm could  be  very  CPU-intensive,  depending  on \nthe  number  of initial actions  that need  to  be  simulated, the  number of time steps \nper trial needed  to obtain a  meaningful long-term reward,  the amount of CPU  per \ntime step needed to make a  decision  with the base controller, and the total number \nof trials needed  to make a  Monte-Carlo decision.  The last factor  depends  on  both \nthe variance in expected reward  per trial, and on how  close the values of competing \ncandidate actions are. \n\nWe  propose two methods to address the potentially large CPU requirements of this \napproach.  First,  the  power  of parallelism can  be  exploited  very  effectively.  The \nalgorithm  is  easily  parallelized  with  high  efficiency:  the  individual  Monte-Carlo \ntrials can be performed independently,  and  the combining of results from  different \ntrials is a simple averaging operation.  Hence there is relatively little communication \nbetween  processors  required  in a  parallel implementation. \nThe second  technique  is  to continually monitor the  accumulated Monte-Carlo sta(cid:173)\ntistics  during  the  simulation,  and  to  prune  away  both candidate  actions  that  are \nsufficiently  unlikely  (outside  some  user-specified  confidence  bound)  to  be  selected \nas  the  best  action,  as  well  as  candidates  whose  values  are  sufficiently  close  to  the \nvalue of the current  best  estimate that  they  are considered  equivalent  (i.e.,  choos(cid:173)\ning  either  would  not  make a  significant  difference).  This technique  requires  more \ncommunication in a  parallel implementation, but offers  potentially large savings in \nthe number of trials needed  to make a  decision. \n\n2  APPLICATION  TO  BACKGAMMON \n\nWe  have  initially applied the  Monte-Carlo  algorithm to making move  decisions  in \nthe game of backgammon.  This is an absorbing Markov process  with perfect state-\n\n\f1070 \n\nG.  Tesauro and G.  R.  Galperin \n\nspace information, and one has a perfect model of the nondeterminism in the system, \nas well  as the mapping from actions to  resulting states. \n\nIn backgammon parlance, the expected  value of a  position is known as the  \"equity\" \nof the  position,  and  estimating  the  equity  by  Monte-Carlo  sampling is  known  as \nperforming a  \"rollout.\"  This involves playing the  position out to completion many \ntimes  with  different  random dice  sequences,  using  a  fixed  policy  P  to  make  move \ndecisions for both sides.  The sequences are terminated at the end of the game (when \none  side  has  borne  off  all  15  checkers),  and  at  that  time a  signed  outcome  value \n(called  \"points\")  is  recorded.  The  outcome value  is  positive  if one  side  wins  and \nnegative if the other side wins, and the magnitude of the value can be either  1,  2,  or \n3,  depending on whether  the win was  normal, a  gammon, or  a  backgammon.  With \nnormal  human play,  games typically  last  on  the  order  of 50-60  time steps.  Hence \nif one  is  using  the  Monte-Carlo player  to play out  actual games,  the  Monte-Carlo \ntrials will on average start out somewhere in the  middle of a  game, and take about \n25-30  time steps to reach  completion. \n\nIn backgammon there  are  on  average about  20  legal moves to consider  in a  typical \ndecision.  The candidate plays frequently  differ in expected  value by on the order  of \n.01.  Thus in order to resolve the best play by Monte-Carlo sampling, one would need \non the order of 10K or more trials per candidate, or a total of hundreds of thousands \nof Monte-Carlo trials to make one move decision.  With extensive statistical pruning \nas  discussed  previously,  this  can  be  reduced  to  several  tens  of thousands  of trials. \nMultiplying  this  by  25-30  decisions  per  trial  with  the  base  player,  we  find  that \nabout a  million base-player decisions have to be  made in  order to make one Monte(cid:173)\nCarlo  decision.  With  typical  human  tournament  players  taking about  10  seconds \nper  move,  we  need  to  parallelize  to  the  point  that  we  can  achieve  at  least  lOOK \nbase-player decisions per second. \n\nOur  Monte-Carlo simulations were  performed  on  the  IBM  SP!  and  SP2  parallel(cid:173)\nRISC supercomputers at IBM Watson and at Argonne National Laboratories.  Each \nSP node is equivalent to a fast  RSj6000, with floating-point capability on the order \nof 100 Mflops.  Typical runs were on configurations of 16-32 SP nodes,  with parallel \nspeedup efficiencies  on  the  order  of 90%. \n\nWe have used  a  variety of base players in our Monte-Carlo simulations, with widely \nvarying playing abilities and CPU requirements.  The weakest  (and fastest)  of these \nis  a  purely  random player.  We  have  also  used  a  few  single-layer networks  (i.e.,  no \nhidden units)  with simple encodings of the  board state,  that were  trained  by  back(cid:173)\npropagation on an expert data set (Tesauro, 1989).  These simple networks also make \nfast  move  decisions,  and  are  much  stronger  than  a  random  player,  but  in  human \nterms are only at a beginner-to-intermediate level.  Finally, we  used some multi-layer \nnets with a  rich input representation,  encoding both the raw board state and many \nhand-crafted features,  trained on self-play using the TD(>.) algorithm (Sutton, 1988; \nTesauro,  1992).  Such  networks  play at an advanced level,  but are too slow  to make \nMonte-Carlo decisions in real time based on full  rollouts to completion.  Results for \nall these  players are presented  in  the following two sections. \n\n2.1  RESULTS  FOR SINGLE-LAYER NETWORKS \n\nWe  measured  the  game-playing strength  of three  single-layer  base  players,  and  of \nthe corresponding Monte-Carlo players,  by playing several thousand games against \na  common benchmark  opponent.  The  benchmark  opponent  was  TD-Gammon 2.1 \n(Tesauro,  1995), playing on its most basic playing level (I-ply search,  i.e.,  no looka(cid:173)\nhead).  Table  1  shows  the  results.  Lin-1  is  a  single-layer  neural  net  with  only  the \nraw  board  description  (number  of White  and  Black  checkers  at  each  location)  as \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1071 \n\nNetwork  Base player  Monte-Carlo player  Monte-Carlo CPU \n\nLin-1 \nLin-2 \nLin-3 \n\n-0.52  ppg \n-0.65  ppg \n-0.32 ppg \n\n-0.01  ppg \n-0.02  ppg \n+0.04 ppg \n\n5 sec/move \n5 sec/move \n10  sec/move \n\nTable 1:  Performance of three simple linear evaluators, for  both initial base players \nand  corresponding  Monte-Carlo  players.  Performance is  measured  in  terms  of ex(cid:173)\npected points per game (ppg) vs.  TO-Gammon 2.11-ply.  Positive numbers indicate \nthat  the player  here  is  better  than  TO-Gammon.  Base  player stats are  the results \nof 30K  trials  (std.  dev.  about  .005),  and  Monte-Carlo stats are  the  results  of 5K \ntrials (std.  dev.  about .02).  CPU times are for  the Monte-Carlo player running on \n32  SP 1 nodes. \n\ninput.  Lin-2  uses  the  same  network  structure  and weights  as  Lin-l,  plus  a  signif(cid:173)\nicant  amount  of random  noise  was  added  to  the  evaluation  function,  in  order  to \ndeliberately  weaken  its  playing ability.  These  networks  were  highly  optimized for \nspeed,  and are capable of making a move decision in about 0.2 msec on a single SP1 \nnode.  Lin-3  uses  the  same raw  board input as  the  other  two  players,  plus it has  a \nfew  additional hand-crafted  features  related  to  the  probability  of a  checker  being \nhit;  there  is  no  noise  added.  This network  is  a  significantly stronger  player,  but is \nabout twice  as slow in making move decisions. \n\nWe  can see  in Table 1 that the Monte-Carlo technique  produces dramatic improve(cid:173)\nment in playing ability for  these  weak initial players.  As  base players,  Lin-1  should \nbe  regarded  as  a  bad intermediate player,  while  Lin-2 is  substantially worse  and is \nprobably  about  equal  to  a  human  beginner.  Both  of these  networks  get  trounced \nby  TO-Gammon, which  on  its  1-ply level  plays  at  strong  advanced  level.  Yet  the \nresulting  Monte-Carlo  players from  these  networks  appear  to  play  about  equal  to \nTO-Gammon l-ply.  Lin-3 is a significantly stronger player, and the resulting Monte(cid:173)\nCarlo  player  appears  to  be clearly  better  than TO-Gammon  l-ply.  It is  estimated \nto  be  about equivalent  to  TO-Gammon on  its 2-ply  level,  which  plays  at  a  strong \nexpert  level. \n\nThe  Monte-Carlo benchmarks reported  in Table 1 involved substantial amounts of \nCPU  time.  At  10  seconds  per  move  decision,  and  25  mOve  decisions  per  game, \nplaying 5000  games against TO-Gammon required  about  350  hours  of 32-node SP \nmachine  time.  We  have  also  developed  an  alternative  testing  procedure,  which \nis  much  less  expensive  in  CPU  time,  but still seems  to  give a  reasonably  accurate \nmeasure of performance strength.  We measure the average equity loss of the Monte(cid:173)\nCarlo  player  on  a  suite  of test  positions.  We  have  a  collection  of about  800  test \npositions,  in which every legal play has been extensively rolled out by TO-Gammon \n2.11-ply.  We then use the TO-Gammon rollout data to grade the quality of a given \nplayer's move decisions. \n\nTest  set  results  for  the  three  linear  evaluators,  and  for  a  random  evaluator,  are \ndisplayed in  Table 2.  It is  interesting to note for  comparison that the TO-Gammon \nl-ply base  player  scores  0.0120  on  this  test  set  measure,  comparable to  the  Lin-1 \nMonte-Carlo  player,  while  TO-Gammon 2-ply  base  player  scores  0.00843,  compa(cid:173)\nrable  to  the  Lin-3  Monte-Carlo  player.  These  results  are exactly in  line  with  what \nwe  measured  in Table  1 using full-game benchmarking, and  thus  indicate that  the \ntest-set  methodology is  in fact  reasonably accurate.  We also note that in each case, \nthere  is  a  huge  error  reduction  of potentially  a  factor  of 4  or  more  in  using  the \nMonte-Carlo technique.  In fact,  the  rollouts summarized in  Table  2  were  done  us(cid:173)\ning fairly  aggressive  statistical  pruning;  we  expect  that  rolling out decisions  more \n\n\f1072 \n\nG.  Tesauro and G.  R.  Galperin \n\nEvaluator  Base  loss  Monte-Carlo loss  Ratio \nRandom \n\nLin-1 \nLin-2 \nLin-3 \n\n0.330 \n0.040 \n0.0665 \n0.0291 \n\n0.131 \n0.0124 \n0.0175 \n0.00749 \n\n2.5 \n3.2 \n3.8 \n3.9 \n\nTable 2:  Average equity loss per move decision  on an 800-position test set, for  both \ninitial base players and corresponding  Monte-Carlo players.  Units are ppgj  smaller \nloss  values  are  better.  Also  computed  is  ratio  of base  player  loss  to  Monte-Carlo \nloss. \n\nextensively  would  give  error  reduction  ratios closer  to factor  of 5,  albeit  at a  cost \nof increased  CPU time. \n\n2.2  RESULTS  FOR MULTI-LAYER NETWORKS \n\nUsing  large  multi-layer  networks  to  do  full  rollouts  is  not  feasible  for  real-time \nmove decisions,  since  the large networks are at least a factor  of 100 slower than the \nlinear evaluators described previously.  We have therefore investigated an alternative \nMonte-Carlo algorithm, using so-called \"truncated rollouts.\u00bb  In this technique trials \nare  not  played  out  to completion,  but  instead  only  a  few  steps  in  the  simulation \nare  taken,  and the neural net's equity estimate of the final  position reached  is  used \ninstead  of the actual outcome.  The truncated rollout algorithm requires  much less \nCPU  time,  due  to  two  factors:  First,  there  are  potentially  many fewer  steps  per \ntrial.  Second,  there  is  much less  variance  per  trial,  since  only a  few  random steps \nare  taken  and  a  real-valued estimate is  recorded,  rather  than  many random steps \nand an integer final  outcome.  These  two factors  combine to give  at least  an  order \nof magnitude  speed-up  compared  to  full  rollouts,  while  still  giving  a  large  error \nreduction  relative to the base player. \n\nTable 3 shows  truncated rollout results for  two multi-layer networks:  TD-Gammon \n2.1  1-ply,  which  has  80  hidden  units,  and  a  substantially  smaller  network  with \nthe same  input  features  but  only  10  hidden  units.  The first  line  of data for  each \nnetwork reflects very extensive rollouts and shows quite large error reduction ratios, \nalthough  the  CPU  times  are  somewhat slower  than  acceptable for  real-time  play. \n(Also we  should be somewhat suspicious of the 80  hidden unit result,  since  this was \nthe  same  network  that  generated  the  data  being  used  to  grade  the  Monte-Carlo \nplayers.)  The second  line  of data shows  what  h~ppens when  the  rollout trials  are \ncut off more aggressively.  This yields significantly faster  run-times,  at the price  of \nonly slightly worse  move decisions. \n\nThe quality of play of the truncated rollout players shown in Table 3 is substantially \nbetter  than  TD-Gammon  I-ply  or  2-ply,  and  it  is  also  substantially  better  than \nthe full-rollout  Monte-Carlo  players described  in  the  previous section.  In  fact,  we \nestimate that the  world's best  human players  would  score  in the range  of 0.005  to \n0.006  on  this  test  set,  so  the  truncated  rollout  players  may actually  be exhibiting \nsuperhuman playing ability, in reasonable amounts of SP machine time. \n\n3  DISCUSSION \n\nOn-line search  may provide a  useful methodology for  overcoming some of the limi(cid:173)\ntations of training nonlinear function approximators on difficult  control tasks.  The \nidea of using search to improve in real time the performance of a heuristic controller \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1073 \n\nHidden  Units  Base  loss  Truncated  Monte-Carlo loss  Ratio  M-C  CPU \n25  sec/move \n9 sec/move \n65  sec(move \n18  sec/move \n\n0.00318 \\ ll-step,  thoroug~) \n0.00433 (ll-step,  optimistic) \n0.00181  \\!-step,  thoroug~) \n0.00269 (7-step,  optimistic) \n\n4.8 \n3.5 \n6.6 \n4.5 \n\n10 \n\n80 \n\n0.0152 \n\n0.0120 \n\nTable  3:  Truncated  rollout  results  for  two  multi-layer  networks,  with  number  of \nhidden units and rollout steps as indicated.  Average equity loss per move decision on \nan 800-position test set, for both initial base players and corresponding Monte-Carlo \nplayers.  Again, units are ppg, and smaller loss  values are better.  Also computed is \nratio  of base  player  loss  to  Monte-Carlo loss.  CPU  times are for  the  Monte-Carlo \nplayer running on 32  SP1  nodes. \n\nis  an old one,  going back at least  to (Shannon,  1950).  Full-width search algorithms \nhave been extensively studied since the time of Shannon, and have produced tremen(cid:173)\ndous  success  in  computer  games such  as  chess,  checkers  and  Othello.  Their  main \ndrawback  is  that the  re~uired CPU  time increases  exponentially with the depth  of \nthe  search,  i.e.,  T  '\" B \n,where  B  is  the  effective  branching  factor  and  D  is  the \nsearch  depth.  In  contrast,  Monte-Carlo search  provides  a  tractable alternative for \ndoing very  deep  searches,  since  the  CPU  time for  a  full  Monte-Carlo decision  only \nscales  as T\", N\u00b7 B  . D, where  N  is the  number of trials in the simulation. \n\nIn  the  backgammon  application,  for  a  wide  range  of  initial  policies,  our  on-line \nMonte-Carlo algorithm, which basically implements a single step of policy iteration, \nwas found  to give very substantial error reductions.  Potentially 80%  or more of the \nbase player's equity loss can be eliminated, depending on how extensive the Monte(cid:173)\nCarlo  trials  are.  The magnitude of the  observed  improvement is  surprising  to  us: \nwhile  it  is  known  theoretically  that each  step  of policy  iteration produces  a  strict \nimprovement, there  are  no  guarantees  on  how  much improvement one  can expect. \nWe  have also  noted a  rough  trend  in the data:  as  one increases  the strength of the \nbase player,  the ratio of error  reduction due to the  Monte-Carlo technique  appears \nto increase.  This could reflect  superlinear convergence properties of policy iteration. \n\nIn cases where the base player employs an evaluator that is able to estimate expected \noutcome, the truncated rollout algorithm appears to offer favorable tradeoffs relative \nto doing full  rollouts to completion.  While the quality of Monte-Carlo decisions  is \nnot as good using truncated rollouts (presumably because the neural net's estimates \nare  biased),  the  degradation  in  quality  is  fairly  small in  at  least  some  cases,  and \nis  compensated  by a  great  reduction  in  CPU  time.  This allows more sophisticated \n(and thus slower) base players to be used,  resulting in decisions which appear to be \nboth better and faster. \n\nThe Monte-Carlo backgammon program as implemented on the SP offers the poten(cid:173)\ntial to achieve real-time move decision performance that exceeds human capabilities. \nIn future  work, we  plan to augment  the program with a  similar Monte-Carlo algo(cid:173)\nrithm for  making doubling decisions.  It is quite possible that such a  program would \nbe  by far  the  world's best  backgammon player. \n\nBeyond the backgammon application, we conjecture that on-line Monte-Carlo search \nmay prove  to  be  useful  in  many other  applications  of reinforcement  learning  and \nadaptive  control.  The main  requirement  is  that  it  should  be  possible  to  simulate \nthe  environment in  which  the  controller  operates.  Since  basically all  of the  recent \nsuccessful  applications  of  reinforcement  learning  have  been  based  on  training  in \nsimulators,  this doesn't  seem  to  be  an  undue  burden.  Thus,  for  example,  Monte-\n\n\f1074 \n\nG.  Tesauro and G.  R.  Galperin \n\nCarlo search may well improve decision-making in the domains of elevator dispatch \n(Crites  and  Barto,  1996) and job-shop scheduling (Zhang and  Dietterich,  1996). \n\nWe  are  additionally  investigating  two  techniques  for  training  a  controller  based \non  the  Monte-Carlo  estimates.  First,  one  could  train  each  candidate  position  on \nits  computed  rollout  equity,  yielding  a  procedure  similar  in  spirit  to  TD(1).  We \nexpect  this  to  converge  to  the  same  policy  as  other  TD(..\\)  approaches,  perhaps \nmore efficiently  due  to  the  decreased  variance  in  the  target  values  as  well  as  the \neasily  parallelizable  nature  of the  algorithm.  Alternately,  the  base  position  - the \ninitial position from which the candidate moves are  being made - could  be  trained \nwith  the  best  equity  value  from  among  all  the  candidates  (corresponding  to  the \nmove chosen  by  the  rollout  player).  In  contrast,  TD(..\\)  effectively  trains  the  base \nposition  with  the  equity  of the  move  chosen  by  the  base  controller.  Because  the \nimproved choice  of move achieved  by the rollout player yields an expectation closer \nto the true (optimal) value, we expect the learned policy to differ from, and possibly \nbe closer  to optimal than,  the original policy. \n\nAcknowledgments \n\nWe  thank  Argonne  National Laboratories for  providing SPI  machine time used  to \nperform  some  of the  experiments  reported  here.  Gregory  Galperin  acknowledges \nsupport  under  Navy-ONR grant N00014-96-1-0311. \n\nReferences \n\nD.  P.  Bertsekas,  Dynamic  Programming and  Optimal  Control.  Athena  Scientific, \nBelmont,  MA  (1995). \nR.  H.  Crites and A.  G. Barto,  \"Improving elevator performance using reinforcement \nlearning.\"  In:  D.  Touretzky et al., eds.,  Advances in Neural Information Processing \nSystems 8,  1017-1023,  MIT Press  (1996). \n\nC.  E.  Shannon,  \"Programming a  computer for  playing chess.\"  Philosophical  Mag(cid:173)\nazine 41,  265-275  (1950). \nR. S. Sutton,  \"Learning to predict by the methods of temporal differences.\"  Machine \nLearning 3,  9-44  (1988). \n\nG.  Tesauro,  \"Connectionist learning of expert preferences  by comparison training.\" \nIn:  D.  Touretzky,  ed.,  Advances  in  Neural  Information  Processing  Systems  1,  99-\n106,  Morgan  Kaufmann (1989). \n\nG.  Tesauro,  \"Practical  issues  in  temporal difference  learning.\"  Machine  Learning \n8,  257-277  (1992). \n\nG. Tesauro,  \"Temporal difference learning and TD-Gammon.\"  Comm.  of the ACM, \n38:3,  58-67  (1995). \n\nW.  Zhang  and  T.  G.  Dietterich,  \"High-performance  job-shop  scheduling  with  a \ntime-delay  TD(\"\\)  network.\" \nIn:  D.  Touretzky  et  al.,  eds.,  Advances  in  Neural \nInformation Processing  Systems 8,  1024-1030, MIT Press  (1996). \n\n\f", "award": [], "sourceid": 1302, "authors": [{"given_name": "Gerald", "family_name": "Tesauro", "institution": null}, {"given_name": "Gregory", "family_name": "Galperin", "institution": null}]}