{"title": "Beating a Defender in Robotic Soccer: Memory-Based Learning of a Continuous Function", "book": "Advances in Neural Information Processing Systems", "page_first": 896, "page_last": 902, "abstract": null, "full_text": "Beating a Defender in Robotic Soccer: \n\nMemory-Based Learning of a Continuous \n\nFUnction \n\nPeter Stone \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nManuela Veloso \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nLearning how to adjust to an opponent's position is critical to \nthe success of having intelligent agents collaborating towards the \nachievement of specific tasks in unfriendly environments. This pa(cid:173)\nper describes our work on a Memory-based technique for to choose \nan action based on a continuous-valued state attribute indicating \nthe position of an opponent. We investigate the question of how an \nagent performs in nondeterministic variations of the training situ(cid:173)\nations. Our experiments indicate that when the random variations \nfall within some bound of the initial training, the agent performs \nbetter with some initial training rather than from a tabula-rasa. \n\n1 \n\nIntroduction \n\nOne of the ultimate goals subjacent to the development of intelligent agents is to \nhave multiple agents collaborating in the achievement of tasks in the presence of \nhostile opponents. Our research works towards this broad goal from a Machine \nLearning perspective. We are particularly interested in investigating how an intel(cid:173)\nligent agent can choose an action in an adversarial environment. We assume that \nthe agent has a specific goal to achieve. We conduct this investigation in a frame(cid:173)\nwork where teams of agents compete in a game of robotic soccer. The real system \nof model cars remotely controlled from off-board computers is under development . \nOur research is currently conducted in a simulator of the physical system. \n\nBoth the simulator and the real-world system are based closely on systems de(cid:173)\nsigned by the Laboratory for ComputationalIntelligence at the University of British \nColumbia [Sahota et a/., 1995, Sahota, 1993]. The simulator facilitates the control \nof any number of cars and a ball within a designated playing area. Care has been \ntaken to ensure that the simulator models real-world responses (friction, conserva-\n\n\fMemory-based Learning of a Continuous Function \n\n897 \n\ntion of momentum, etc.) as closely as possible. Figure l(a) shows the simulator \ngraphics. \n\n-\n\nIJ \n\nI \n\n- <0 \n\n0 \n\n~ \n\nIJ \n\n(j -\n\n~ \n\n<:P \n\n(a) \n\n(b) \n\nFigure 1: (a) the graphic view of our simulator. (b) The initial position for all \nof the experiments in this paper. The teammate (black) remains stationary, the \ndefender (white) moves in a small circle at different speeds, and the ball can move \neither directly towards the goal or towards the teammate. The position of the ball \nrepresents the position of the learning agent. \n\nWe focus on the question of learning to choose among actions in the presence of \nan adversary. This paper describes our work on applying memory-based supervised \nlearning to acquire strategy knowledge that enables an agent to decide how to \nachieve a goal. For other work in the same domain, please see [Stone and Veloso , \n1995b]. For an extended discussion of other work on incremental and memory(cid:173)\nbased learning [Aha and Salzberg, 1994, Kanazawa, 1994, Kuh et al., 1991, Moore, \n1991, Salganicoff, 1993, Schlimmer and Granger , 1986, Sutton and Whitehead, 1993, \nWettschereck and Dietterich, 1994, Winstead and Christiansen, 1994], particularly \nas it relates to this paper, please see [Stone and Veloso, 1995a]. \n\nThe input to our learning task includes a continuous-valued range of the position \nof the adversary. This raises the question of how to discretize the space of values \ninto a set of learned features . Due to the cost of learning and reusing a large set of \nspecialized instances, we notice a clear advantage to having an appropriate degree \nof generalization . For more details please see [Stone and Veloso, 1995a]. \n\nHere , we address the issue of the effect of differences between past episodes and the \ncurrent situation. We performed extensive experiments, training the system under \nparticular conditions and then testing it (with learning continuing incrementally) in \nnondeterministic variations of the training situation. Our results show that when \nthe random variations fall within some bound of the initial training , the agent \nperforms better with some initial training rather than from a tabula-rasa. This \nintuitive fact is interestingly well- supported by our empirical results. \n\n2 Learning Method \n\nThe learning method we develop here applies to an agent trying to learn a function \nwith a continuous domain. We situate the method in the game of robotic soccer. \n\nWe begin each trial by placing a ball and a stationary car acting as the \"teammate\" \nin specific places on the field. Then we place another car, the \"defender,\" in front of \nthe goal. The defender moves in a small circle in front of the goal at some speed and \nbegins at some random point along this circle. The learning agent must take one \nof two possible actions: shoot straight towards the goal, or pass to the teammate so \n\n\f898 \n\nP. STONE, M. VELOSO \n\nthat the ball will rebound towards the goal. A snapshot of the experimental setup \nis shown graphically in Figure 1 (b). \n\nThe task is essentially to learn two functions, each with one continuous input vari(cid:173)\nable, namely the defender's position. Based on this position, which can be repre(cid:173)\nsented unambiguously as the angle at which the defender is facing, \u00a2, the agent tries \nto learn the probability of scoring when shooting, Ps* (\u00a2), and the probability of scor(cid:173)\ning when passing , P; (\u00a2 ).1 If these functions were learned completely, which would \nonly be possible if the defender's motion were deterministic, then both functions \nwould be binary partitions: Ps*, P; : [0.0,360.0) f--.+ {-1 , I}. 2 That is, the agent \nwould know without doubt for any given \u00a2 whether a shot, a pass, both, or neither \nwould achieve its goal. However, since the agent cannot have had experience for \nevery possible \u00a2, and since the defender may not move at the same speed each time, \nthe learned functions must be approximations: Ps,Pp : [0.0,360.0) f--.+ [-1.0, 1.0] . \nIn order to enable the agent to learn approximations to the functions Ps* and P*, \nwe gave it a memory in which it could store its experiences and from which it coufd \nretrieve its current approximations Ps (\u00a2) and Pp( \u00a2). We explored and developed \nappropriate methods of storing to and retrieving from memory and an algorithm \nfor deciding what action to take based on the retrieved values. \n\n2.1 Memory Model \n\nStoring every individual experience in memory would be inefficient both in terms \nof amount of memory required and in terms of generalization time. Therefore, we \nstore Ps and Pp only at discrete, evenly-spaced values of \u00a2. That is, for a memory \nof size M (with M dividing evenly into 360 for simplicity), we keep values of Pp(O) \nand Ps(O) for 0 E {360n/M I 0 ~ n < M}. We store memory as an array \"Mem\" \nof size M such that Mem[n] has values for both Pp(360n/M) and Ps(360n/M) . \nUsing a fixed memory size precludes using memory-based techniques such as K(cid:173)\nNearest-Neighbors (kNN) and kernel regression which require that every experience \nbe stored, choosing the most relevant only at decision time. Most of our experiments \nwere conducted with memories of size 360 (low generalization) or of size 18 (high \ngeneralization), i.e. M = 18 or M = 360. The memory size had a large effect on \nthe rate of learning [Stone and Veloso, 1995a]. \n\n2.1.1 Storing to Memory \n\nWith M discrete memory storage slots, the problem then arises as to how a specific \ntraining example should be generalized. Training examples are represented here as \nE.p,a,r, consisting of an angle \u00a2, an action a, and a result r where \u00a2 is the initial \nposition of the defender, a is \"s\" or \"p\" for \"shoot\" or \"pass,\" and r is \"I\" or \n\"-I\" for \"goal\" or \" miss\" respectively. For instance, E 72 .345 ,p ,1 represents a pass \nresulting in a goal for which the defender started at position 72.345 0 on its circle. \nEach experience with 0 - 360/2M :::; \u00a2 < 0 + 360/2M affects Mem[O] in propor(cid:173)\ntion to the distance 10 - \u00a2I. \nIn particular, Mem[O] keeps running sums of the \nmagnitudes of scaled results, Mem[O]. total-a-results, and of scaled positive results, \nMem[O].positive-a-results, affecting Pa(O), where \"a\" stands for \"s\" or \"p\" as be(cid:173)\nfore. Then at any given time P (0) = -1 + 2 * positive-a-results The \"-I\" is for \n\n, \n\na \n\ntotal-a-results \n\n. \n\n1 As per convention, P * represents the target (optimal) function. \n2 Although we think of P; and P; as functions from angles to probabilities, we will use \n-1 rather than 0 as the lower bound of the range. This representation simplifies many of \nour illustrative calculations. \n\n\fMemory-based Learning of a Continuous Function \n\n899 \n\nthe lower bound of our probability range, and the \"2*\" is to scale the result to this \nrange. Call this our adaptive memory storage technique: \n\nAdaptive Memory Storage of E4>,a,r in Mem 0 \n\nI _ \n-\n\nr * \n\n(1 _ 14>-01) \n360/M . \n\n\u2022 r \n\u2022 Mem[O].total-a-results += r'o \n\u2022 If r' > 0 Then Mem[O].positive-a-results += r'o \n\u2022 P (0) = -1 + 2 * posittve-a-results. \n\ntotal-a-resuLts \n\na \n\nFor example, EllO,p,l wOilld set both total-p-results and positive-p-results for \nMem[120] (and Mem[100]) to 0.5 and consequently Pp(120) (and Pp(100)) to 1.0. \nBut then E l25 ,p,-1 would increment total-p-resultsfor Mem[120] by .75, while leav-\ning positive-p-results unchanged. Thus Pp(120) becomes -1 + 2 * 1:~5 = -.2. \nThis method of storing to memory is effective both for time-varying concepts and \nfor concepts involving random noise. It is able to deal with conflicting examples \nwithin the range of the same memory slot. \n\nNotice that each example influences 2 different memory locations. This memory \nstorage technique is similar to the kNN and kernel regression function approximation \ntechniques which estimate f( \u00a2) based on f( 0) possibly scaled by the distance from \no to \u00a2 for the k nearest values of O. In our linear continuum of defender position, \nour memory generalizes training examples to the 2 nearest memory locations.3 \n\n2.1.2 Retrieving from Memory \n\nSince individual training examples affect multiple memory locations, we use a simple \ntechnique for retrieving Pa (\u00a2) from memory when deciding whether to shoot or to \npass. We round \u00a2 to the nearest 0 for which Mem[O] is defined, and then take Pa (0) \nas the value of Pa(\u00a2). Thus, each Mem[O] represents Pa(\u00a2) for 0 - 360/2M ~ \u00a2 < \no + 360 /2M. Notice that retrieval is much simpler when using this technique than \nwhen using kNN or kernel regression: we look directly to the closest fixed memory \nposition, thus eliminating the indexing and weighting problems involved in finding \nthe k closest training examples and (possibly) scaling their results. \n\n2.2 Choosing an Action \n\nThe action selection method is designed to make use of memory to select the action \nmost probable to succeed, and to fill memory when no useful memories are available. \nFor example, when the defender is at position \u00a2, the agent begins by retrieving Pp (\u00a2) \nand Ps( \u00a2) as described above. Then, it acts according to the following function: \n\nIf Pp_(<fJ) = P.(<fJ) (no basis for a decision), shoot or pass randomly. \nelse If Pp(<fJ) > 0 and Pp(<fJ) > Ps(<fJ), pass. \n\nelse If P.(<fJ) > 0 and P.(<fJ) > Pp(<fJ), shoot. \n\nelse If Pp(<fJ) = 0, (no previous passes) pass. \n\nelse If P.(<fJ) = 0, (no previous shots) shoot. \n\nelse (Pp(<fJ),P.(<fJ) < 0) shoot or pass randomly. \n\nAn action is only selected based on the memory values if these values indicate that \none action is likely to succeed and that it is better than the other. If, on the other \nhand, neither value Pp(\u00a2) nor Ps(\u00a2) indicate a positive likelihood of success, then \nan action is chosen randomly. The only exception to this last rule is when one of \n\n3For particularly large values of M it is useful to generalize training examples to more \nmemory locations, particularly at the early stages of learning. However for the values of \nM considered in this paper, we always generalize to the 2 nearest memory locations. \n\n\f900 \n\nP.STONE,M.VELOSO \n\nthe values is zero,4 suggesting that there has not yet been any training examples for \nthat action at that memory location. In this case, there is a bias towards exploring \nthe untried action in order to fill out memory. \n\n3 Experiments and Results \n\nIn this section, we present the results of our experiments. We explore our agent's \nability to learn time-varying and nondeterministic defender behavior . \n\nWhile examining the results, keep in mind that even if the agent used the functions \nP; and P; to decide whether to shoot or to pass, the success rate would be signif(cid:173)\nicantly less than 100% (it would differ for different defender speeds): there were \nmany defender starting positions for which neither shooting nor passing led to a \ngoal (see Figure 2). For example, from our experiments with the defender moving \n\nI -\n\nI -\n\nD ..... \u00b7 \u00b7 \n\n(b) \n\nCd) \n\nFigure 2: For different defender starting positions (solid rectangle), the agent can \nscore when (a) shooting, (b) passing, (c) neither, or (d) both. \n\nat a constant speed of 50, 5 we found that an agent acting optimally scores 73.6% \nof the time; an agent acting randomly scores only 41.3% of the time. These values \nset good reference points for evaluating our learning agent's performance. \n\n3.1 Coping with Changing Concepts \n\nFigure 3 demonstrates the effectiveness of adaptive memory when the defender's \nspeed changes. In all of the experiments represented in these graphs, the agent \n\nSuccess Rate vs. Defender Speed: Memory Size = 360 \n\n80r-~~--~~~--~~--~ \n\nSuccess Rate vs. Defender Speed: Memory Size = 18 \n\nBOr-~~--~~~--~~~--' \n\n75 \n\n65 \n\n60 \n\n55 \n\n50 \n\n45 \n\n40 \n\n75 \n70 ~ ... - -\".\"\"':::::'\" .:.::.-~ ......... ~._. \n65 \n\n~ \n\nFirst 1000 trials -\nNext 1000 trials .. - .. \nTheoreHcal optimum ..... \n\n60 \n\n55 \n\n50 \n\n45 \n\n40 \n\nFirst 1000 trials -\nNext 1000 trials .-.. \n.. .. \n\nTheoretical optimum \n\n35L-~~--~~~--~~~~ \n90 100 \n\n30 \n\n40 \n\n10 \n\n60 \n\n20 \n\n50 \n\n70 \n\n80 \n\nDefender Speed \n\n35L-~~--~~~--~~~~ \n90 100 \n\n60 \n\n50 \n\n40 \n\n70 \n\n80 \n\n10 \n\n30 \n\n20 \n\nDefender Speed \n\nFigure 3: For all trials shown in these graphs, the agent began with a memory \ntrained for a defender moving at constant speed 50. \n\nstarted with a memory trained by attempting a single pass and a single shot with \nthe defender starting at each position 0 for which Mem[O] is defined and moving in \n\n4Recall that a memory value of 0 is equivalent to a probability of .5, representing no \n\nreason to believe that the action will succeed or fail. \n\nSIn the simulator, \"50\" represents 50 cm/s. Subsequently, we omit the units. \n\n\fMemory-based Learning of a Continuous Function \n\n901 \n\nits circle at speed 50. We tested the agent 's performance with the defender moving \nat various (constant) speeds. \n\nWith adaptive memory, the agent is able to unlearn the training that no longer \napplies and approach optimal behavior: it re-Iearns the new setup. During the first \n1000 trials the agent suffers from having practiced in a different situation (especially \nfor the less generalized memory, M = 360) , but then it is able to approach optimal \nbehavior over the next 1000 trials. Remember that optimal behavior, represented \nin the graph, leads to roughly a 70% success rate, since at many starting positions, \nneither passing nor shooting is successful. \n\nFrom these results we conclude that our adaptive memory can effectively deal with \nIt can also perform well when the defender 's motion is \ntime-varying concepts. \nnondeterministic, as we show next. \n\n3.2 Coping with Noise \n\nTo model nondeterministic motion by the defender, we set the defender's speed \nrandomly within a range. For each attempt this speed is constant, but it varies from \nattempt to attempt. Since the agent observes only the defender's initial position, \nfrom the point of view of the agent, the defender's motion is nondeterministic. \n\nThis set of experiments was designed to test the effectiveness of adaptive memory \nwhen the defender's speed was both nondeterministic and different from the speed \nused to train the existing memory. The memory was initialized in the same way as \nin Section 3.1 (for defender speed 50). We ran experiments in which the defender's \nspeed varied between 10 and 50. We compared an agent with trained memory \nagainst an agent with initially empty memories as shown in Figure 4. \n\nSuccess Rate VS . Trial #: M=18, Defender speed 10-50 \n70 r-~---.--~--~--r--.--~---.--. \n\n55 \n\n50 \n\n45 \n\nNo initial memory \nFull initial memory -\n\n-\n\n40 L-~ __ -L __ ~ __ ~ __ L-~ _ _ ~ _ _ ~~ \n\n50 100 1 50 200 250 300 350 400 450 500 \n\nTrial Number \n\nFigure 4: A comparison of the effectiveness of starting with an empty memory \nversus starting with a memory trained for a constant defender speed (50) different \nfrom that used during testing. Success rate is measured as goal percentage thus far. \n\nThe agent with full initial memory outperformed the agent with initially empty \nmemory in the short run. The agent learning from scratch did better over time \nsince it did not have any training examples from when the defender was moving \nat a fixed speed of 50; but at first, the training examples for speed 50 were better \nthan no training examples. Thus, when you would like to be successful immediately \nupon entering a novel setting, adaptive memory allows training in related situations \nto be effective without permanently reducing learning capacity. \n\n\f902 \n\n4 Conclusion \n\nP. STONE, M. VELOSO \n\nOur experiments demonstrated that online, incremental, supervised learning can be \neffective at learning functions with continuous domains. We found that adaptive \nmemory made it possible to learn both time-varying and nondeterministic concepts. \nWe empirically demonstrated that short-term performance was better when acting \nwith a memory trained on a concept related to but different from the testing con(cid:173)\ncept, than when starting from scratch. This paper reports experimental results on \nour work towards multiple learning agents, both cooperative and adversarial, III a \ncontinuous environment. \n\nFuture work on our research agenda includes simultaneous learning of the defender \nand the controlling agent in an adversarial context. We will also explore learning \nmethods with several agents where teams are guided by planning strategies. In \nthis way we will simultaneously study cooperative and adversarial situations using \nreactive and deliberative reasoning. \n\nAcknow ledgements \n\nWe thank Justin Boyan and the anonymous reviewers for their helpful suggestions. This research is \nsponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, \nand the Advanced Research Projects Agency (ARPA) under grant number F33615-93-1-1330. The views \nand conclusions contained in this document are those of the authors and should not be interpreted \nas necessarily representing the official policies or endorsements, either expressed or implied, of Wright \nLaboratory or the U. S. Government. \n\nReferences \n[Aha and Salzberg, 1994] David W . Aha and Steven L. Salzberg. Learning to catch: Applying nearest \nneighbor algorithms to dynamic control tasks. In P. Cheeseman and R. W. Oldford, editors, Selecttng \nModels from Data : Artificial Intelhgence and StattStics IV. SpringIer-Verlag, New York, NY, 1994. \n[Kanazawa, 1994] Keiji Kanazawa. Sensible decisions: Toward a theory of decision-theoretic information \ninvariants. In Proceedings of the Twelfth National Conference on Art~ficial Intelligence, pages 973-\n978, 1994 . \n\n[Kuh et al., 1991] A. Kuh, T. Petsche, and R.L. Rivest. Learning time-varying concepts. In Advances \n\nin Neural Information Processing Systems 3, pages 183-189. Morgan Kaufman, December 1991. \n\n[Moore, 1991] A.W . Moore . Fast, robust adaptive control by learning only forward models. In Advances \n\nin Neural Information Processing Systems 3. Morgan Kaufman, December 1991. \n\n[Sahota et al., 1995] Michael K. Sahota, Alan K. Mackworth, Rod A. Barman, and Stewart J. Kingdon. \nReal-time control of soccer-playing robots using off-board vision : the dynamite testbed. In IEEE \nInternahonal Conference on Systems, Man, and Cybernetics, pages 3690-3663, 1995. \n\n[Sahota, 1993] Michael K . Sahota. Real-time intelligent behaviour in dynamic environments: Soccer(cid:173)\n\nplaying robots. Master's thesis, University of British Columbia, August 1993. \n\n[Salganicoff, 1993] Marcos Salganicoff. Density-adaptive learning and forgetting. In Proceedmgs of the \n\nTenth International Conference on Machine Learning, pages 276-283, 1993. \n\n[Schlimmer and Granger, 1986] J.C. Schlimmer and R.H. Granger. Beyond incremental processing: \nTracking concept drift. In Proceedings of the Fiffth National Conference on Artifictal Intelligence , \npages 502-507. Morgan Kaufman, Philadelphia, PA, 1986. \n\n[Stone and Veloso, 1995a] Peter Stone and Manuela Veloso. Beating a defender in robotic soccer: \n\nMemory-based learning of a continuous function. Technical Report CMU-CS-95-222, Computer Sci(cid:173)\nence Department, Carnegie Mellon University, 1995. \n\n[Stone and Veloso, 1995b] Peter Stone and Manuela Veloso. Broad learning from narrow training: A case \nstudy in robotic soccer. Technical Report CMU-CS-95-207, Computer Science Department, Carnegie \nMellon University, 1995. \n\n[Sutton and Whitehead, 1993] Richard S. Sutton and Steven D. Whitehead. Online learning with ran(cid:173)\n\ndom representations. In ProceedIngs of the Tenth International Conference on Machine Learnmg, \npages 314-321, 1993. \n\n[Wettschereck and Dietterich, 1994] Dietrich Wettschereck and Thomas Dietterich. Locally adaptive \nnearest neighbor algorithms. In J. D . Cowan, G. Tesauro, and J. Alspector, editors, Advances in \nNeural Informatton Processing Systems 6, pages 184-191, San Mateo, CA, 1994. Morgan Kaufmann. \n[Winstead and Christiansen, 1994] Nathaniel S. Winstead and Alan D. Christiansen. Pinball: Planning \nand learning in a dynamic real-time environment. In AAAI-9-4 Fall Symposium on Control of the \nPhysical World by Intelligent Agents, pages 153-157, New Orleans, LA, November 1994. \n\n\f", "award": [], "sourceid": 1089, "authors": [{"given_name": "Peter", "family_name": "Stone", "institution": null}, {"given_name": "Manuela", "family_name": "Veloso", "institution": null}]}