{"title": "Learning Compositional Neural Programs with Recursive Tree Search and Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 14673, "page_last": 14683, "abstract": "We propose a novel reinforcement learning algorithm, AlphaNPI, that incorpo-\nrates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI\ncontributes structural biases in the form of modularity, hierarchy and recursion,\nwhich are helpful to reduce sample complexity, improve generalization and in-\ncrease interpretability. AlphaZero contributes powerful neural network guided\nsearch algorithms, which we augment with recursion. AlphaNPI only assumes\na hierarchical program specification with sparse rewards: 1 when the program\nexecution satisfies the specification, and 0 otherwise. This specification enables\nus to overcome the need for strong supervision in the form of execution traces\nand consequently train NPI models effectively with reinforcement learning. The\nexperiments show that AlphaNPI can sort as well as previous strongly supervised\nNPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with\ntwo disks and is shown to generalize to puzzles with an arbitrary number of disks.\nThe experiments also show that when deploying our neural network policies, it is\nadvantageous to do planning with guided Monte Carlo tree search.", "full_text": "Learning Compositional Neural Programs\nwith Recursive Tree Search and Planning\n\nThomas Pierrot\n\nInstaDeep\n\nGuillaume Ligner\n\nInstaDeep\n\nt.pierrot@instadeep.com\n\ng.ligner@instadeep.com\n\nScott Reed\nDeepMind\n\nreedscot@google.com\n\nOlivier Sigaud\n\nSorbonne Universit\u00e9\n\nolivier.sigaud@upmc.fr\n\nNicolas Perrin\n\nCNRS, Sorbonne Universit\u00e9\n\nperrin@isir.upmc.fr\n\nAlexandre Laterre\n\nInstaDeep\n\na.laterre@instadeep.com\n\nDavid Kas\nInstaDeep\n\nd.kas@instadeep.com\n\nKarim Beguir\n\nInstaDeep\n\nkb@instadeep.com\n\nNando de Freitas\n\nDeepMind\n\nnandodefreitas@google.com\n\nAbstract\n\nWe propose a novel reinforcement learning algorithm, AlphaNPI, that incorpo-\nrates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI\ncontributes structural biases in the form of modularity, hierarchy and recursion,\nwhich are helpful to reduce sample complexity, improve generalization and in-\ncrease interpretability. AlphaZero contributes powerful neural network guided\nsearch algorithms, which we augment with recursion. AlphaNPI only assumes\na hierarchical program speci\ufb01cation with sparse rewards: 1 when the program\nexecution satis\ufb01es the speci\ufb01cation, and 0 otherwise. This speci\ufb01cation enables\nus to overcome the need for strong supervision in the form of execution traces\nand consequently train NPI models effectively with reinforcement learning. The\nexperiments show that AlphaNPI can sort as well as previous strongly supervised\nNPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with\ntwo disks and is shown to generalize to puzzles with an arbitrary number of disks.\nThe experiments also show that when deploying our neural network policies, it is\nadvantageous to do planning with guided Monte Carlo tree search.\n\n1\n\nIntroduction\n\nLearning a wide variety of skills, which can be reused and repurposed to learn more complex skills\nor to solve new problems, is one of the central challenges of arti\ufb01cial intelligence (AI). As argued in\nBengio et al. [2019], beyond achieving good generalization when both the training and test data come\nfrom the same distribution, we want knowledge acquired in one setting to transfer to other settings\nwith different but possibly related distributions.\nModularity is a powerful inductive bias for achieving this goal with neural networks [Parascandolo\net al., 2018, Bengio et al., 2019]. Here, we focus on a particular modular representation known as\nNeural Programmer-Interpreters (NPI) [Reed and de Freitas, 2016]. The NPI architecture consists\nof a library of learned program embeddings that can be recomposed to solve different tasks, a core\nrecurrent neural network that learns to interpret arbitrary programs, and domain-speci\ufb01c encoders\nfor different environments. NPI achieves impressive multi-task results, with strong improvements in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fgeneralization and reductions in sample complexity. While \ufb01xing the interpreter module, Reed and\nde Freitas [2016] also showed that NPI can learn new programs by re-using existing ones.\nThe NPI architecture can also learn recursive programs. In particular, Cai et al. [2017] demonstrates\nthat it is possible to take advantage of recursion to obtain theoretical guarantees on the generalization\nbehaviour of recursive NPIs. Recursive NPIs are thus amenable to veri\ufb01cation and easy interpretation.\nThe NPI approach at \ufb01rst appears to be very general because as noted in [Reed and de Freitas, 2016],\nprograms appear in many guises in AI; for example, as image transformations, as structured control\npolicies, as classical algorithms, and as symbolic relations. However, NPI suffers from one important\nlimitation: It requires supervised training from execution traces. This is a much stronger demand for\nsupervision than input-output pairs. Thus the practical interest has been limited.\nSome works have attempted to relax this strong supervision assumption. Li et al. [2017] and Fox\net al. [2018] train variations of NPI using mostly low-level demonstration trajectories but still require\na few full execution traces. Indeed, Fox et al. [2018] states \u201cOur results suggest that adding weakly\nsupervised demonstrations to the training set can improve performance at the task, but only when the\nstrongly supervised demonstrations already get decent performance\u201d.\nXiao et al. [2018] incorporate combinatorial abstraction techniques from functional programming into\nNPI. They report no dif\ufb01culties when learning using strong supervision, but substantial dif\ufb01culties\nwhen attempting to learn NPI models with curricula and REINFORCE. In fact, this policy gradient\nreinforcement learning (RL) algorithm fails to learn simple recursive NPIs, attesting to the dif\ufb01culty\nof applying RL to learn NPI models.\nThis paper demonstrates how to train NPI models effectively with RL for the \ufb01rst time. We remove\nthe need for execution traces in exchange for a speci\ufb01cation of programs and associated correctness\ntests on whether each program has completed successfully. This allows us to train the agent by telling\nit what needs to be done, instead of how it should be done. In other words, we show it is possible to\novercome the need for strong supervision by replacing execution traces with a library of programs we\nwant to learn and corresponding tests that assess whether a program has executed correctly.\nThe user specifying to the agent what to do, and not how to do it is reminiscent of programmable\nagents Denil et al. [2017] and declarative vs imperative programming. In our case, the user may also\nde\ufb01ne a hierarchy in the program speci\ufb01cation indicating which programs can be called by another.\nThe RL problem at-hand has a combinatorial nature, making it exceptionally hard to solve. Fortunately,\nwe have witnessed signi\ufb01cant progress in this area with the recent success of AlphaZero [Silver et al.,\n2017] in the game of Go. In the single-agent setting, Laterre et al. [2018] have demonstrated the\npower of AlphaZero when solving combinatorial bin packing problems.\nIn this work, we reformulate the original NPI as an actor-critic network and endow the search process\nof AlphaZero with the ability to handle hierarchy and recursion. These modi\ufb01cations, in addition to\nother more subtle changes detailed in the paper and appendices, enable us to construct a powerful RL\nagent, named AlphaNPI, that is able to train NPI models by RL1.\nAlphaNPI is shown to match the performance of strongly supervised versions of NPI in the ex-\nperiments. The experiments also shed light on the issue of deploying neural network RL policies.\nSpeci\ufb01cally, we \ufb01nd that agents that harness Monte Carlo tree search (MCTS) planning at test time\nare more effective than plain neural network policies.\n\n2 Problem statement and de\ufb01nitions\n\nWe consider an agent interacting with an environment, choosing actions a and making observations e.\nAn example of this is bubble sort, where the environment is represented as a list of numbers, and the\ninitial actions are one-step pointer moves and element swaps. We call this initial set of actions atomic\nactions. As training progresses, the agent learns to pro\ufb01t from atomic actions to acquire higher-level\nprograms. Once a program is learned, it is incorporated into the set of available actions. For example,\nin bubble sort, the agent may learn the program RESET, which moves all pointers to the beginning of\nthe list, and subsequently the agent may harness the program RESET as an action.\n\n1The code is available at https://github.com/instadeepai/AlphaNPI\n\n2\n\n\fIn our approach, a program has pre-conditions and post-conditions, which are tests on the environment\nstate. All pre-conditions must be satis\ufb01ed before execution. A program executes correctly if its\npost-conditions are veri\ufb01ed. For example, the pre-condition for bubble sort is that both pointers are\npositioned at the beginning of the list. The post-condition is a test indicating whether the list is sorted\nupon termination. A program terminates when it calls the atomic action STOP, which is assumed to\nbe available in all environments. A level is associated with each program, enabling us to de\ufb01ne a\nhierarchy: Atomic actions have level 0 and any other program has a positive level. In our work, a\nprogram can only call lower-level programs or itself.\nWe formulate learning a hierarchical library of programs as a multi-task RL problem. In this setting,\neach task corresponds to learning a single program. The action space consists of atomic actions and\nlearned programs. The reward signal is 1 if a program executes correctly, and 0 otherwise. The\nagent\u2019s goal is to maximize its expected reward over all the tasks. In other words, it has to learn all\nthe programs in the input speci\ufb01cation.\n\n3 AlphaNPI\n\nOur proposed agent, AlphaNPI, augments the NPI architecture of Reed and de Freitas [2016] to\nconstruct a recursive compositional neural network policy and a value function estimator, as illustrated\nin Figure 1. It also extends the MCTS procedure of Silver et al. [2017] to enable recursion. The\n\nFigure 1: AlphaNPI modular neural network architecture.\n\nAlphaNPI network architecture consists of \ufb01ve modules: State (or observation) encoders, a program\nembedding matrix, an LSTM [Hochreiter and Schmidhuber, 1997] interpreter, a policy (actor)\nnetwork and a value network. Some of these modules are universal, and some are task-dependent.\nThe architecture is consequently a natural \ufb01t for multi-task learning.\nPrograms are represented as vector embeddings p indexed by i in a library. As usual, we use an\nembedding matrix for this (Mprog). The observation encoder produces a vector of features s. The\nuniversal LSTM core interprets and executes arbitrary programs while conditioning on these features\nand its internal memory state h. The vector p corresponding to index i (stored in Mprog) is used by the\nLSTM core to know which program is being executed. A one-hot encoding of i could have been used\ninstead, but the vector representation is more compact, and furthermore, since the components of p\nare parameters of the network updated during training, their optimization can lead to generalization\nproperties, as intuitively two programs with similar vector embeddings would yield relatively similar\naction decisions. The policy network converts the LSTM output to a vector of probabilities \u03c0 over\nthe action space, while the policy network uses this output to estimate the value function V . The\narchitecture is summarized by the following equations:\n\nst = fenc(et), pt = Mprog[it, :], ht = flstm(st, pt, ht\u22121), \u03c0t = factor(ht), Vt = fvalue(ht).\n\n(1)\nThe neural nets have parameters, but we omit them in our notation to simplify the presentation. These\nparameters and the program embeddings are learned simultaneously during training by RL.\n\n3\n\n\fFigure 2: Execution trace generation with AlphaNPI to solve the Tower of Hanoi puzzle. 1. To execute\nthe i-th program, TOWEROFHANOI, AlphaNPI generates an execution trace (a1, . . . , aT ), with observations\n(e1, . . . , eT ) produced by the environment and actions at \u223c \u03c0mcts\nproduced by MCTS using the latest neural\nnet, see Figure 3. When the action STOP is chosen, the program\u2019s post-conditions are evaluated to compute the\n, r) are stored in a replay buffer. 2. The neural network parameters\n\ufb01nal reward r. The tuples (et, i, ht, \u03c0mcts\nare updated to maximise the similarity of its policy vector output \u03c0 to the search probabilities \u03c0mcts, and to\nminimise the error between the predicted value V and the \ufb01nal reward r. To train the neural network, shown in\nFigure 1, we use the data in the replay buffer.\n\nt\n\nt\n\nWhen this AlphaNPI network executes a program, it can either call a learned sub-program or itself\n(recursively), or perform an atomic action. When the atomic action is STOP, the program terminates\nand control is returned to the calling program using a stack. When a sub-program is called, the stack\ndepth increases and the LSTM memory state h is set to a vector of zeroes. This turns out to be very\nimportant for verifying the model [Cai et al., 2017].\nTo generate data to train the AlphaNPI network by RL, we introduce a variant of AlphaZero using\nrecursive MCTS. The general training procedure is illustrated in Figure 2, which is inspired by\nFigure 1 of Silver et al. [2017], but for a single-agent with hierarchical structure in this case. The\nMonte Carlo tree search (MCTS) guided by the AlphaNPI network enables the agent to \u201cimagine\u201d\nlikely future scenarios and hence output an improved policy \u03c0mcts, from which the next action is\nchosen2. This is repeated throughout the episode until the agent outputs the termination command\nSTOP. If the program\u2019s post-conditions are satis\ufb01ed, the agent obtains a \ufb01nal reward of 1, and 0\notherwise.\nThe data generated during these episodes is in turn used to retrain the AlphaNPI network. In particular,\nwe record the sequence of observations, tree policies, LSTM internal states and rewards. We store the\n\n2A detailed description of AlphaNPI is provided in Appendix A.\n\n4\n\n1. Execution Trace Generation2. Neural Network TrainingAlpha NPITOWEROFHANOIindex Alpha NPITOWEROFHANOIindex Alpha NPITOWEROFHANOIindex \fFigure 3: Monte-Carlo tree search with AlphaNPI for the Tower of Hanoi puzzle. 1. Each simulation\ntraverses the tree by \ufb01nding the actions that maximize the sum of the action value Q, an upper con\ufb01dence bound\nU and a term L that encourages programs to call programs near the same level. 2. When the selected program is\nnot atomic and the node has never been visited before, a new sub-tree is constructed. In the sub-tree, the LSTM\ninternal state is initialized to zero. When the sub-tree search terminates, the LSTM internal state is reset to its\nprevious calling state. 3. The leaf node is expanded and the associated observation e and program index i are\nevaluated by the AlphaNPI network to compute action probabilities P = \u03c0 and values V . 4. The quantities Q\nand U are computed using the network predictions. 5. Once the search is complete, the tree policy vector \u03c0mcts\nis returned. The next program in the execution trace is chosen according to \u03c0mcts, until the program STOP is\nchosen or a computational budget is exceeded.\n\nexperience tuples (e, i, h, \u03c0mcts, r) in a replay buffer. The data in this replay buffer is used to train\nthe AlphaNPI network, as illustrated in Figure 2 .\nThe search approach is depicted in Figure 3 for a Tower of Hanoi example, see also the corresponding\nFigure 2 of Silver et al. [2017]. A detailed description of the search process, including pseudo-code,\nappears in Appendix A. Subsequently, we present an overview of this component of AlphaNPI.\nFor a speci\ufb01c program indexed by i, a node in the search tree corresponds to an observation e and an\nedge corresponds to an action a. As in AlphaZero, the neural network outputs the action probabilities\nand node values. These values are used, in conjunction with visit counts, to compute upper con\ufb01dence\nbounds U and action-value functions Q during search. Unlike AlphaZero, we add terms L in the node\nselection stage to encourage programs not to call programs at a much lower level. In addition, we use\na different estimate of the action-value function that better matches the environments considered in\nthis paper. Actions are selected by maximizing Q + V + L.\nAlso unlike AlphaZero, if the selected action is not atomic but an already learned program, we\nrecursively build a new Monte Carlo tree for that program. To select a trajectory in the tree, that is\nthe program\u2019s imagined execution trace, we play nsimu simulations and record the number of visits\nto each node. This enables us to compute a tree policy vector \u03c0mcts for each node, as detailed in\nAppendix A.5, which favours actions that have been most selected during the simulations.\nThe major feature of AlphaNPI is its ability to construct recursively a new tree during the search to\nexecute an already learned program. This approach enables to use learned skills as if they were atomic\nactions. When a tree is initialized to execute a new program, the LSTM internal state is initialized\nto zero and the environment reward signal changes to re\ufb02ect the speci\ufb01cation of the new program.\nThe root node of the new tree corresponds to the current state of the environment. When the search\nprocess terminates, we check that the \ufb01nal environment state satis\ufb01es the program\u2019s post-conditions.\nIf unsatis\ufb01ed, we discard the full execution trace and start again. When returning control to an\n\n5\n\n1. Select2. Tree Recursion3. Expand and Evaluate4. Backup5. Actobservationinternal stateprogram indexAlpha NPIAlpha NPI\fupper-level program, we assign to the LSTM the previous internal state for that level and continue\nthe search process.\nWe found that discarding execution traces for programs executed incorrectly is necessary to achieve\nstable training. Indeed, the algorithm might choose the correct sequence of actions but still fail\nbecause one of the chosen sub-programs did not execute correctly. At the level we are trying to learn,\npossibly no mistake has been made, so it is wise to discard this data for training stability.\nFinally, we use AlphaNPI MCTS in two different modes. In exploration mode, we use a high budget\nof simulations, \ufb01nal actions are taken by sampling according to the tree policy vectors and we add\nDirichlet noise to the network priors for better exploration. This mode is used during training. In\nExploitation mode, we use a low budget of simulations, \ufb01nal actions are taken according to the tree\npolicy vectors argmax and we do not add noise to the priors. In this mode, AlphaNPI\u2019s behavior is\ndeterministic. This mode is used during validation and test.\n\n3.1 Training procedure\n\nDuring a training iteration, the agent selects a program i to learn. It plays nep episodes (See Appendix\nE for speci\ufb01c values) using the tree search in exploration mode with a large budget of simulations.\nThe generated experiences, (e, i, h, \u03c0mcts, r), where r is the episode \ufb01nal reward, are stored in a\nreplay buffer. The agent is trained with the Adam optimizer on this data, so as to minimize the loss\nfunction:\n\nlog \u03c0\n\n+ (V \u2212 r)2\n\n.\n\n(2)\n\n(cid:88)\n\nbatch\n\n\u2212(cid:0)\u03c0mcts(cid:1)T\n(cid:124)\n(cid:123)(cid:122)\n\n(cid:96)policy\n\n(cid:96) =\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:96)value\n\n(cid:125)\n\nNote that the elements of a mini-batch may correspond to different tasks and are not necessarily\nadjacent in time. Given that the buffer memory is short, we make the assumption that the LSTM\ninternal states have not changed too much. Thus, we do not use backpropagation through time to\ntrain the LSTM. Standard backpropagation is used instead, which facilitates parallelization.\nAfter each Adam update, we perform validation on all tasks for nval episodes. The agent average\nperformance is recorded and used for curriculum learning, as discussed in the following subsection.\n\n3.2 Curriculum learning\n\nAs with previous NPI models, curriculum learning plays an essential role. As programs are organized\ninto levels, we begin by training the agent on programs of level 1 and then increase the level when\nthe agent\u2019s performance is higher than a speci\ufb01c threshold. Our curriculum strategy is similar to the\none by Andreas et al. [2017].\nAt each training iteration, the agent must choose the next program to learn. We initially assign equal\nprobability to all level 1 programs and zero probability to all other programs. At each iteration, we\nupdate the probabilities according to the agent\u2019s validation performance. We increase the probability\nof programs on which the agent performed poorly and decrease the probabilities of those on which\nthe agent performed well. We compute scores ci = 1 \u2212 Ri, for each program indexed by i, where Ri\nis a moving average of the reward accrued by this program during validation. The program selection\nprobability is then de\ufb01ned as a softmax over these scores. When min\nRi becomes greater than some\nthreshold \u2206curr, we increase the maximum program level, thus allowing the agent to learn level 2\nprograms, and so on until it has learned every program.\n\ni\n\n4 Experiments\n\nIn the following experiments, we aim to assess the ability of our RL agent, AlphaNPI, to perform the\nsorting tasks studied by Reed and de Freitas [2016] and Cai et al. [2017]. We also consider a simple\nrecursive Tower of Hanoi Puzzle. An important question we would like to answer is: Can AlphaNPI,\nwhich is trained by RL only, perform as well as the iterarive and recursive variants of NPI, which are\ntrained with a strong supervisory signal consisting of full execution traces? Also, how essential is\nMCTS planning when deploying the neural network policies?\n\n6\n\n\fLength\n\n10\n20\n60\n100\n\nIterative BUBBLESORT\n\nRecursive BUBBLESORT\n\nNet with planning Net only Net with planning Net only\n\n70%\n60%\n35%\n10%\n\nTable 1: Performance of AlphaNPI, trained on BUBBLESORT instances of length up to 7, on much longer input\nlists. For each BUBBLESORT variant, iterative and recursive, we deployed the trained AlphaNPI networks with\nand without MCTS planning. The results clearly highlight the importance of planning at deployment time.\n\n100%\n100%\n95%\n40%\n\nLength\n\n3\n4\n5\n6\n\n85%\n85%\n40%\n10%\n\n94%\n42%\n10%\n1%\n\n100%\n100%\n100%\n100%\n\n78%\n22%\n5%\n1%\n\nSorting without a hierarchy\nNet with planning Net only\n\nTable 2: Test performance on iterative sorting with no use of hierarchy. The AlphaNPI network is trained to\nsort using only atomic actions on lists of length up to 4, and tested on lists of length up to 6. The training time\nwithout hierarchy scales quadratically with list length, but only linearly with list length when a hierarchy is\nde\ufb01ned.\n\n4.1 Sorting example\n\nWe consider an environment consisting of a list of n integers and two pointers referencing its elements.\nThe agent can move both pointers and swap elements at the pointer positions. The goal is to learn a\nhierarchy of programs and to compose them to realize the BUBBLESORT algorithm. The library of\nprograms is summarized in Table 4 of the Appendix.\nWe trained AlphaNPI to learn the sorting library of programs on lists of length 2 to 7. Each iteration\ninvolves 20 episodes, so the agent can see up to 20 different training lists. As soon as the agent\nsucceeds, training is stopped, so the agent typically sees less than 20 examples per iteration.\nWe validated on lists of length 7 and stopped when the minimum averaged validation reward,\namong all programs, reached \u2206curr. After training, we measured the generalization of AlphaNPI, in\nexploitation mode, on test lists of length 10 to 100, as shown in Table 1. For each length, we test on\n40 randomly generated lists.\nWe observe that AlphaNPI can learn the iterative BUBBLESORT algorithm on lists of length up to\n7 and generalize well to much longer lists. The original NPI, applied to iterative BUBBLESORT,\nhad to be trained with strong supervision on lists of length 20 to achieve the same generalization.\nAs reported by Cai et al. [2017], when training on arrays of length 2, the iterative NPI with strong\nsupervision fails to generalize but the recursive NPI generalizes perfectly. However, when training\nthe recursive NPI with policy gradients RL and curricula, Xiao et al. [2018] reports poor results.\nTo assess the contribution of adding a hierarchy to the model, we trained AlphaNPI with atomic\nactions only to learn iterative BUBBLESORT. As reported on Table 2, this ablation performs poorly in\ncomparison to the hierarchical solutions.\nWe also de\ufb01ned a sorting environment in which the programs RESET, BUBBLE and BUBBLESORT are\nrecursive. This setting corresponds to the \u201cfull recursive\u201d case of Cai et al. [2017]. Being able to\nlearn recursive programs requires adapting environment. For instance, when a new task (recursive\nprogram) is started, the sorting environment becomes a sub-list of the original list. When the task\nterminates, the environment is reset to the previous list.\nWe trained the full recursive BUBBLESORT on lists of length 2 to 4 and validated on lists of length\n7. After training, we assessed the generalization capabilities of the recursive AlphaNPI in Table 1.\nThe results indicate that the recursive version outperforms the iterative one, con\ufb01rming the results\nreported by Cai et al. [2017]. We also observe that AlphaNPI with planning is able to match the\ngeneralization performance of the recursive NPI with strong supervision, but that removing planning\nfrom deployment (i.e. using a network policy only) reduces performance.\n\n7\n\n\fNumber of disks MCTS Network only\n\n2\n5\n10\n12\n\n100%\n100%\n100%\n100%\n\n100%\n100%\n100%\n100%\n\nTable 3: Test performance of one AlphaNPI trained agent on the recursive Tower of Hanoi puzzle.\n\n4.2 Tower of Hanoi puzzle\n\nWe trained AlphaNPI to solve the Tower of Hanoi puzzle recursively. Speci\ufb01cally, we consider an\nenvironment with 3 pillars and n disks of increasing disk size. Each pillar is given one of three roles:\nsource, auxiliary or target. Initially, the n disks are placed on the source pillar. The goal is to move\nall disks to the target pillar, never placing a disk on a smaller one. It can be proven that the minimum\nnumber of moves is 2n \u2212 1, which results in a highly combinatorial problem. Moreover, the iterative\nsolution depends on the parity of the number of disks, which makes it very hard to learn a general\niterative solution with a neural network.\nTo solve this problem recursively, one must be able to call the TOWEROFHANOI program to move\nn \u2212 1 disks from the source pillar to the auxiliary pillar, then move the larger disk from the source\npillar to target pillar and \ufb01nally call again the TOWEROFHANOI program to move the n \u2212 1 pillars\nfrom the auxiliary pillar to the target.\nWe trained our algorithm to learn the recursive solution on problem instances with 2 disks, stopping\nwhen the minimum of the validation average rewards reached \u2206curr. Test results are shown in Table 3.\nAlphaNPI generalizes to instances with a greater number of disks.\nIn Appendix C, we show that once trained, an AlphaNPI agent can generalize to Tower of Hanoi\npuzzles with an arbitrary number of disks.\n\n5 Related work\n\nAlphaZero [Silver et al., 2017] used Monte Carlo Tree Search for planning and to derive a policy\nimprovement operator to train state-of-the-art neural network agents for playing Go, Chess and Shogi\nusing deep reinforcement learning. In [Laterre et al., 2018], AlphaZero is adapted to the setting of\none-player games applied to the combinatorial problem of bin packing. This work casts program\ninduction as a one player game and further adapts AlphaZero to incorporate compositional structure\ninto the learned programs.\nMany existing approaches to neural program induction do not explicitly learn programs in symbolic\nform, but rather implicitly in the network weights and then directly predict correct outputs given\nquery inputs. For example, the Neural GPU [Kaiser and Sutskever, 2015] can learn addition and\nmultiplication of binary numbers from examples. Neural module networks [Andreas et al., 2016] add\nmore structure by learning to stitch together differentiable neural network modules to solve question\nanswering tasks. Neural program meta induction [Devlin et al., 2017a] shows how to learn implicit\nneural programs in a few-shot learning setting.\nAnother class of neural program induction methods takes the opposite approach of explicitly syn-\nthesizing programs in symbolic form. DeepCoder [Balog et al., 2016] and RobustFill [Devlin et al.,\n2017b] learn in a supervised manner to generate programs for list and string manipulation using\ndomain speci\ufb01c languages. In [Evans and Grefenstette, 2018], explanatory rules are learned from\nnoisy data. Ellis et al. [2018] shows how to generate graphics programs to reproduce hand drawn\nimages. In [Sun et al., 2018], programs are generated from visual demonstrations. Chen et al. [2017]\nshows how to learn parsing programs from examples and their parse trees. Verma et al. [2018] shows\nhow to distill programmatically-interpretable agents from conventional Deep RL agents.\nSome approaches lie in between fully explicit and implicit, for example by making execution\ndifferentiable in order to learn parts of programs or to optimize programs [Bo\u0161njak et al., 2017,\nBunel et al., 2016, Gaunt et al., 2016]. In [Nye et al., 2019], an LSTM generator conditioned on\nspeci\ufb01cations is used to produce schematic outlines of programs, which are then fed to a simple\n\n8\n\n\flogical program synthesizer. Similarly, Shin et al. [2018] use LSTMs to map input-output pairs to\ntraces and subsequently map these traces to code.\nNeural Programmer-Interpreters [Reed and de Freitas, 2016], which we extend in this work, learn\nto execute a hierarchy of programs from demonstration. Cai et al. [2017] showed that by learning\nrecursive instead of iterative forms of algorithms like bubble sort, NPI can achieve perfect generaliza-\ntion from far fewer demonstrations. Here, perfect generalization means generalization with provable\ntheoretical guarantees. Neural Task Programming [Xu et al., 2018] adapted NPI to the setting of\nrobotics in order to learn manipulation behaviors from visual demonstrations and annotations of the\nprogram hierarchy.\nSeveral recent works have reduced the training data requirements of NPI, especially the \u201cstrong\nsupervision\u201d of demonstrations at each level of the program hierarchy. For example, Li et al. [2017]\nand Fox et al. [2018] show how to train variations of NPI using mostly low-level demonstration\ntrajectories and a relatively smaller proportion of hierarchical annotations compared to NPI. However,\ndemonstrations are still required. Xiao et al. [2018] incorporates combinator abstraction techniques\nfrom functional programming into NPI to improve training, but emphasize the dif\ufb01culty of learning\nsimple NPI models with RL algorithms.\nHierarchical reinforcement learning combined with deep neural networks has received increased\nattention in the past several years [Osa et al., 2019, Nachum et al., 2018b, Kulkarni et al., 2016,\nNachum et al., 2018a, Levy et al., 2018, Vezhnevets et al., 2017], mainly applied to ef\ufb01cient training\nof agents for Atari, navigation and continuous control. This work shares a similar motivation of using\nhierarchy to improve generalization and sample ef\ufb01ciency, but we focus on algorithmic problem\ndomains and learning potentially recursive neural programs without any demonstrations.\nWhile AlphaZero does not use hierarchies or recursion, hierarchical MCTS algorithms have been\npreviously proposed for simple hierarchical RL domains [Vien and Toussaint, 2015, Bai et al., 2016].\nThe current work capitalizes on advances brought in by deep reinforcement learning as well as design\nchoices particular to this paper to signi\ufb01cantly extend this research frontier.\nFinally, as demonstrated in the original NPI paper, the modular approach with context-dependent\ninput embeddings and a task independent interpreter is ideal for meta-learning and transfer. Recent\nmanifestations of this idea of using an embedding to re-program a core neural network to facilitate\nmeta-learning include Zintgraf et al. [2019] and Chen et al. [2019]. To the best of our knowledge the\nidea of programmable neural networks goes back several decades to the original Parallel Distributed\nProgramming (PDP) papers of Jay McClelland and colleagues. We leave transfer and meta-learning\nas a future explorations for AlphaNPI.\n\n6 Conclusion\n\nThis paper proposed and demonstrated the \ufb01rst effective RL agent for training NPI models: AlphaNPI.\nAlphaNPI extends NPI to the RL domain and enhances AlphaZero with the inductive biases of\nmodularity, hierarchy and recursion. AlphaNPI was shown to match the performance of strongly\nsupervised versions of NPI in the sorting experiment, and to generalize remarkably well in the Tower\nof Hanoi environment. The experiments also shed light on the issue of deploying neural network RL\npolicies. Speci\ufb01cally, we found out that agents that harness MCTS planning at test time are much\nmore effective than plain neural network policies.\nWhile our test domains are complex along some axes, e.g. recursive and combinatorial, they are\nsimple along others, e.g. the environment model is available. The natural next step is to consider\nenvironments, such as robot manipulation, where it is also important to learn perception modules and\nlibraries of skills in a modular way to achieve transfer to new tasks with few data. It will be fascinating\nto harness imperfect environment models in these environments and assess the performance of MCTS\nplanning when launching AlphaNPI policies.\n\n7 Acknowledgements\n\nWork by Nicolas Perrin was partially supported by the French National Research Agency (ANR),\nProject ANR-18-CE33-0005 HUSKI.\n\n9\n\n\fReferences\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In IEEE Computer\n\nVision and Pattern Recognition, pages 39\u201348, 2016.\n\nJacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches.\n\nIn International Conference on Machine Learning, pages 166\u2013175, 2017.\n\nAijun Bai, Siddharth Srivastava, and Stuart Russell. Markovian state and action abstractions for MDPs via\n\nhierarchical MCTS. In International Joint Conference on Arti\ufb01cial Intelligence, pages 3029\u20133037, 2016.\n\nMatej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder:\n\nLearning to write programs. In International Conference on Learning Representations, 2016.\n\nYoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, S\u00e9bastien Lachapelle, Olexa Bilaniuk,\nAnirudh Goyal, and Christopher J. Pal. A meta-transfer objective for learning to disentangle causal mecha-\nnisms. arXiv preprint arXiv:1901.10912, 2019.\n\nMatko Bo\u0161njak, Tim Rockt\u00e4schel, Jason Naradowsky, and Sebastian Riedel. Programming with a differentiable\n\nforth interpreter. In International Conference on Machine Learning, pages 547\u2013556, 2017.\n\nRudy R Bunel, Alban Desmaison, Pawan K Mudigonda, Pushmeet Kohli, and Philip Torr. Adaptive neural\n\ncompilation. In Conference on Neural Information Processing Systems, pages 1444\u20131452, 2016.\n\nJonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion.\n\nIn International Conference on Learning Representations, 2017.\n\nXinyun Chen, Chang Liu, and Dawn Song. Towards synthesizing complex programs from input-output examples.\n\nIn International Conference on Learning Representations, 2017.\n\nYutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C.\nCobo, Andrew Trask, Ben Laurie, Caglar Gulcehre, Aaron van den Oord, Oriol Vinyals, and Nando de Freitas.\nSample ef\ufb01cient adaptive text-to-speech. In International Conference on Learning Representations, 2019.\n\nMisha Denil, Sergio Gomez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable\n\nagents. arXiv preprint arXiv:1706.06383, 2017.\n\nJacob Devlin, Rudy R Bunel, Rishabh Singh, Matthew Hausknecht, and Pushmeet Kohli. Neural program\n\nmeta-induction. In Conference on Neural Information Processing Systems, pages 2080\u20132088, 2017a.\n\nJacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet\nKohli. RobustFill: Neural program learning under noisy I/O. In International Conference on Machine\nLearning, pages 990\u2013998, 2017b.\n\nAshley D Edwards, Laura Downs, and James C Davidson. Forward-backward reinforcement learning. arXiv\n\npreprint arXiv:1803.10227, 2018.\n\nKevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics programs\nfrom hand-drawn images. In Conference on Neural Information Processing Systems, pages 6059\u20136068, 2018.\n\nRichard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Arti\ufb01cial\n\nIntelligence Research, 61:1\u201364, 2018.\n\nRoy Fox, Richard Shin, Sanjay Krishnan, Ken Goldberg, Dawn Song, and Ion Stoica. Parametrized hierarchical\n\nprocedures for neural programming. In International Conference on Learning Representations, 2018.\n\nAlexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor,\nand Daniel Tarlow. Terpret: A probabilistic programming language for program induction. arXiv preprint\narXiv:1608.04428, 2016.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n\u0141ukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. arXiv preprint arXiv:1511.08228, 2015.\n\nTejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement\nlearning: Integrating temporal abstraction and intrinsic motivation. In Conference on Neural Information\nProcessing Systems, pages 3675\u20133683, 2016.\n\n10\n\n\fAlexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Kas, Karl Hajjar, Torbjorn S\nDahl, Amine Kerkeni, and Karim Beguir. Ranked reward: Enabling self-play reinforcement learning for\ncombinatorial optimization. arXiv preprint arXiv:1807.01672, 2018.\n\nAndrew Levy, Robert Platt, and Kate Saenko. Hierarchical reinforcement learning with hindsight. arXiv preprint\n\narXiv:1805.08180, 2018.\n\nChengtao Li, Daniel Tarlow, Alexander L Gaunt, Marc Brockschmidt, and Nate Kushman. Neural program\n\nlattices. In International Conference on Learning Representations, 2017.\n\nO\ufb01r Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for\n\nhierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018a.\n\nO\ufb01r Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-ef\ufb01cient hierarchical reinforcement\n\nlearning. In Conference on Neural Information Processing Systems, pages 3303\u20133313, 2018b.\n\nMaxwell I. Nye, Luke B. Hewitt, Joshua B. Tenenbaum, and Armando Solar-Lezama. Learning to infer program\n\nsketches. arXiv preprint arXiv:1902.06349, 2019.\n\nTakayuki Osa, Voot Tangkaratt, and Masashi Sugiyama. Hierarchical reinforcement learning via advantage-\n\nweighted information maximization. In International Conference on Learning Representations, 2019.\n\nGiambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Sch\u00f6lkopf. Learning independent\n\ncausal mechanisms. In International Conference on Machine Learning, pages 4036\u20134044, 2018.\n\nScott Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference on Learning\n\nRepresentations, 2016.\n\nRichard Shin, Illia Polosukhin, and Dawn Song. Improving neural program synthesis with inferred execution\n\ntraces. In Conference on Neural Information Processing Systems, pages 8931\u20138940, 2018.\n\nDavid Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas\nHubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.\nNature, 550(7676):354\u2013359, 2017.\n\nShao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph Lim. Neural program synthesis from diverse\n\ndemonstration videos. In International Conference on Machine Learning, pages 4797\u20134806, 2018.\n\nAbhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Program-\nmatically interpretable reinforcement learning. In International Conference on Machine Learning, pages\n5052\u20135061, 2018.\n\nAlexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and\nKoray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference\non Machine Learning, pages 3540\u20133549, 2017.\n\nNgo Anh Vien and Marc Toussaint. Hierarchical Monte-Carlo planning. In National Conference on Arti\ufb01cial\n\nIntelligence (AAAI), 2015.\n\nDa Xiao, Jo-Yu Liao, and Xingyuan Yuan. Improving the universality and learnability of neural programmer-\n\ninterpreters with combinator abstraction. arXiv preprint arXiv:1802.02696, 2018.\n\nDanfei Xu, Suraj Nair, Yuke Zhu, Julian Gao, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Neural task\nprogramming: Learning to generalize across hierarchical tasks. In IEEE International Conference on Robotics\n& Automation, pages 1\u20138, 2018.\n\nLuisa M. Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context\n\nadaptation via meta-learning. In International Conference on Machine Learning, 2019.\n\n11\n\n\f", "award": [], "sourceid": 8290, "authors": [{"given_name": "Thomas", "family_name": "PIERROT", "institution": "InstaDeep"}, {"given_name": "Guillaume", "family_name": "Ligner", "institution": "InstaDeep"}, {"given_name": "Scott", "family_name": "Reed", "institution": "Google DeepMind"}, {"given_name": "Olivier", "family_name": "Sigaud", "institution": "Sorbonne University"}, {"given_name": "Nicolas", "family_name": "Perrin", "institution": "ISIR, Sorbonne Universit\u00e9"}, {"given_name": "Alexandre", "family_name": "Laterre", "institution": "InstaDeep"}, {"given_name": "David", "family_name": "Kas", "institution": "InstaDeep"}, {"given_name": "Karim", "family_name": "Beguir", "institution": "InstaDeep"}, {"given_name": "Nando", "family_name": "de Freitas", "institution": "DeepMind"}]}