{"title": "Imitation-Projected Programmatic Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15752, "page_last": 15763, "abstract": "We study the problem of programmatic reinforcement learning, in which policies are represented as short programs in a symbolic language. Programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for such policies remains a challenge. Our approach to this challenge - a meta-algorithm called PROPEL - is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a form of mirror descent that takes a gradient step into the unconstrained policy space and then projects back onto the constrained space.  Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing state-of-the-art deep policy gradient approaches.  Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for PROPEL and empirically evaluate the approach in three continuous control domains. The experiments show that PROPEL can significantly outperform state-of-the-art approaches for learning programmatic policies.", "full_text": "Imitation-Projected Programmatic Reinforcement\n\nLearning\n\nAbhinav Verma\u2217\nRice University\n\naverma@rice.edu\n\nHoang M. Le\u2217\n\nCaltech\n\nhmle@caltech.edu\n\nYisong Yue\n\nCaltech\n\nyyue@caltech.edu\n\nSwarat Chaudhuri\n\nRice University\n\nswarat@rice.edu\n\nAbstract\n\nWe study the problem of programmatic reinforcement learning, in which policies\nare represented as short programs in a symbolic language. Programmatic poli-\ncies can be more interpretable, generalizable, and amenable to formal veri\ufb01cation\nthan neural policies; however, designing rigorous learning approaches for such\npolicies remains a challenge. Our approach to this challenge \u2014 a meta-algorithm\ncalled PROPEL\u2014 is based on three insights. First, we view our learning task as\noptimization in policy space, modulo the constraint that the desired policy has a\nprogrammatic representation, and solve this optimization problem using a form of\nmirror descent that takes a gradient step into the unconstrained policy space and\nthen projects back onto the constrained space. Second, we view the unconstrained\npolicy space as mixing neural and programmatic representations, which enables\nemploying state-of-the-art deep policy gradient approaches. Third, we cast the pro-\njection step as program synthesis via imitation learning, and exploit contemporary\ncombinatorial methods for this task. We present theoretical convergence results for\nPROPEL and empirically evaluate the approach in three continuous control domains.\nThe experiments show that PROPEL can signi\ufb01cantly outperform state-of-the-art\napproaches for learning programmatic policies.\n\nIntroduction\n\n1\nA growing body of work [58, 8, 60] investigates reinforcement learning (RL) approaches that represent\npolicies as programs in a symbolic language, e.g., a domain-speci\ufb01c language for composing control\nmodules such as PID controllers [5]. Short programmatic policies offer many advantages over neural\npolicies discovered through deep RL, including greater interpretability, better generalization to unseen\nenvironments, and greater amenability to formal veri\ufb01cation. These bene\ufb01ts motivate developing\neffective approaches for learning such programmatic policies.\nHowever, programmatic reinforcement learning (PRL) remains a challenging problem, owing to the\nhighly structured nature of the policy space. Recent state-of-the-art approaches employ program\nsynthesis methods to imitate or distill a pre-trained neural policy into short programs [58, 8]. How-\never, such a distillation process can yield a highly suboptimal programmatic policy \u2014 i.e., a large\ndistillation gap \u2014 and the issue of direct policy search for programmatic policies also remains open.\nIn this paper, we develop PROPEL (Imitation-Projected Programmatic Reinforcement Learning),\na new learning meta-algorithm for PRL, as a response to this challenge. The design of PROPEL\nis based on three insights that enables integrating and building upon state-of-the-art approaches\nfor policy gradients and program synthesis. First, we view programmatic policy learning as a\nconstrained policy optimization problem, in which the desired policies are constrained to be those\nthat have a programmatic representation. This insight motivates utilizing constrained mirror descent\napproaches, which take a gradient step into the unconstrained policy space and then project back onto\nthe constrained space. Second, by allowing the unconstrained policy space to have a mix of neural\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u03c0(s)\nb\n\n::= a | Op(\u03c01(s), . . . , \u03c0k(s)) | if b then \u03c01(s) else \u03c02(s) | \u2295\u03b8(\u03c01(s), . . . , \u03c0k(s))\n::= \u03c6(s) | BOp(b1, . . . , bk)\n\nFigure 1: A high-level syntax for programmatic policies, inspired by [58]. A policy \u03c0(s) takes\na state s as input and produces an action a as output. b represents boolean expressions; \u03c6 is a\nboolean-valued operator on states; Op is an operator that combines multiple policies into one policy;\nBOp is a standard boolean operator; and \u2295\u03b8 is a \u201clibrary function\" parameterized by \u03b8.\n\nif (s[TrackPos] < 0.011 and s[TrackPos] > \u22120.011)\n\nthen PID(cid:104)RPM,0.45,3.54,0.03,53.39(cid:105)(s) else PID(cid:104)RPM,0.39,3.54,0.03,53.39(cid:105)(s)\n\nFigure 2: A programmatic policy for acceleration in TORCS [59], automatically discovered by\nPROPEL. s[TrackPos] represents the most recent reading from sensor TrackPos.\n\nand programmatic representations, we can employ well-developed deep policy gradient approaches\n[55, 36, 47, 48, 19] to compute the unconstrained gradient step. Third, we de\ufb01ne the projection\noperator using program synthesis via imitation learning [58, 8], in order to recover a programmatic\npolicy from the unconstrained policy space. Our contributions can be summarized as:\n\u2022 We present PROPEL, a novel meta-algorithm that is based on mirror descent, program synthesis,\n\u2022 On the theoretical side, we show how to cast PROPEL as a form of constrained mirror descent.\nWe provide a thorough theoretical analysis characterizing the impact of approximate gradients\nand projections. Further, we prove results that provide expected regret bounds and \ufb01nite-sample\nguarantees under reasonable assumptions.\n\nand imitation learning, for PRL.\n\n\u2022 On the practical side, we provide a concrete instantiation of PROPEL and evaluate it in three contin-\nuous control domains, including the challenging car-racing domain TORCS [59]. The experiments\nshow signi\ufb01cant improvements over state-of-the-art approaches for learning programmatic policies.\n\n2 Problem Statement\nThe problem of programmatic reinforcement learning (PRL) consists of a Markov Decision Process\n(MDP) M and a programmatic policy class \u03a0. The de\ufb01nition of M = (S,A, P, c, p0, \u03b3) is standard\n[54], with S being the state space, A the action space, P (s(cid:48)|s, a) the probability density function\nof transitioning from a state-action pair to a new state, c(s, a) the state-action cost function, p0(s)\na distribution over starting states, and \u03b3 \u2208 (0, 1) the discount factor. A policy \u03c0 : S \u2192 A\n(stochastically) maps states to actions. We focus on continuous control problems, so S and A are\nassumed to be continuous spaces. The goal is to \ufb01nd a programmatic policy \u03c0\u2217 \u2208 \u03a0 such that:\n\n\u03c0\u2217 = argmin\n\nJ(\u03c0),\n\nwhere: J(\u03c0) = E\n\n,\n\n(1)\n\n(cid:34) \u221e(cid:88)\n\ni=0\n\n(cid:35)\n\u03b3ic(si, ai \u2261 \u03c0(si))\n\n\u03c0\u2208\u03a0\n\nwith the expectation taken over the initial state distribution s0 \u223c p0, the policy decisions, and the\ntransition dynamics P . One can also use rewards, in which case (1) becomes a maximization problem.\nProgrammatic Policy Class. A programmatic policy class \u03a0 consists of policies that can be\nrepresented parsimoniously by a (domain-speci\ufb01c) programming language. Recent work [58, 8, 60]\nindicates that such policies can be easier to interpret and formally verify than neural policies, and can\nalso be more robust to changes in the environment.\nIn this paper, we consider two concrete classes of programmatic policies. The \ufb01rst, a simpli\ufb01cation of\nthe class considered in Verma et al. [58], is de\ufb01ned by the modular, high-level language in Figure 1.\nThis language assumes a library of parameterized functions \u2295\u03b8 representing standard controllers, for\ninstance Proportional-Integral-Derivative (PID) [6] or bang-bang controllers [11]. Programs in the\nlanguage take states s as inputs and produce actions a as output, and can invoke fully instantiated\nlibrary controllers along with prede\ufb01ned arithmetic, boolean and relational operators. The second,\n\u201clower-level\" class, from Bastani et al. [8], consists of decision trees that map states to actions.\nExample. Consider the problem of learning a programmatic policy, in the language of Figure 1, that\ncontrols a car\u2019s accelerator in the TORCS car-racing environment [59]. Figure 2 shows a program in\n, where j identi\ufb01es\nour language for this task. The program invokes PID controllers PID(cid:104)j,\u03b8P ,\u03b8I ,\u03b8D(cid:105)\n\n2\n\n\fAlgorithm 1 Imitation-Projected Programmatic Reinforcement Learning (PROPEL)\n1: Input: Programmatic & Neural Policy Classes: \u03a0 & F.\n2: Input: Either initial \u03c00 or initial f0\n3: De\ufb01ne joint policy class: H \u2261 \u03a0 \u2295 F\n4: if given initial f0 then\n\u03c00 \u2190 PROJECT(f0)\n5:\n6: end if\n7: for t = 1, . . . , T do\n8:\n9:\n10: end for\n11: Return: Policy \u03c0T\n\n//h \u2261 \u03c0 + f de\ufb01ned as h(s) = \u03c0(s) + f (s)\n\nht \u2190 UPDATEF (\u03c0t\u22121, \u03b7)\n\u03c0t \u2190 PROJECT\u03a0(ht)\n\n//program synthesis via imitation learning\n\n//policy gradient in neural policy space with learning rate \u03b7\n\n//program synthesis via imitation learning\n\nthe sensor (out of 29, in our experiments) that provides inputs to the controller, and \u03b8P , \u03b8I, and \u03b8D\nare respectively the real-valued coef\ufb01cients of the proportional, integral, and derivative terms in the\ncontroller. We note that the program only uses the sensors TrackPos and RPM. While TrackPos (for\nthe position of the car relative to the track axis) is used to decide which controller to use, only the\nRPM sensor is needed to calculate the acceleration.\nLearning Challenges. Learning programmatic policies in the continuous RL setting is challenging,\nas the best performing methods utilize policy gradient approaches [55, 36, 47, 48, 19], but policy\ngradients are hard to compute in programmatic representations. In many cases, \u03a0 may not even be\ndifferentiable. For our approach, we only assume access to program synthesis methods that can select\na programmatic policy \u03c0 \u2208 \u03a0 that minimizes imitation disagreement with demonstrations provided by\na teaching oracle. Because imitation learning tends to be easier than general RL in long-horizon tasks\n[53], the task of imitating a neural policy with a program is, intuitively, signi\ufb01cantly simpler than\nthe full programmatic RL problem. This intuition is corroborated by past work on programmatic RL\n[58], which shows that direct search over programs often fails to meet basic performance objectives.\n3 Learning Algorithm\nTo develop our approach, we take the viewpoint of (1) being a constrained optimization problem,\nwhere \u03a0 \u2282 H resides within a larger space of policies H. In particular, we will represent H \u2261 \u03a0\u2295F\nusing a mixing of programmatic policies \u03a0 and neural polices F. Any mixed policy h \u2261 \u03c0 + f can\nbe invoked as h(s) = \u03c0(s) + f (s). In general, we assume that F is a good approximation of \u03a0 (i.e.,\nfor each \u03c0 \u2208 \u03a0 there is some f \u2208 F that approximates it well), which we formalize in Section 4.\nWe can now frame our constrained learning problem as minimizing (1) over \u03a0 \u2282 H, that alternate\nbetween taking a gradient step in the general space H and projecting back down onto \u03a0. This \u201clift-and-\nproject\u201d perspective motivates viewing our problem via the lens of mirror descent [40]. In standard\nmirror descent, the unconstrained gradient step can be written as h \u2190 hprev \u2212 \u03b7\u2207HJ(hprev) for step\nsize \u03b7, and the projection can be written as \u03c0 \u2190 argmin\u03c0(cid:48)\u2208\u03a0 D(\u03c0(cid:48), h) for divergence measure D.\nOur approach, Imitation-Projected Programmatic Reinforcement Learning (PROPEL), is outlined\nin Algorithm 1 (also see Figure 3). PROPEL is a meta-algorithm that requires instantiating two\nsubroutines, UPDATE and PROJECT, which correspond to the standard update and projection steps,\nrespectively. PROPEL can be viewed as a form of functional mirror descent with some notable\ndeviations from vanilla mirror descent.\nUPDATEF\nSince policy gradient methods are well-\ndeveloped for neural policy classes F (e.g., [36, 47, 48, 30,\n24, 19]) and non-existent for programmatic policy classes \u03a0,\nPROPEL is designed to leverage policy gradients in F and\navoid policy gradients in \u03a0. Algorithm 2 shows one instanti-\nation of UPDATEF . Note that standard mirror descent takes\nunconstrained gradient steps in H rather than F, and we\ndiscuss this discrepancy between UPDATEH and UPDATEF\nin Section 4.\nPROJECT\u03a0. Projecting onto \u03a0 can be implemented using\nprogram synthesis via imitation learning, i.e., by synthesiz-\ning a \u03c0 \u2208 \u03a0 to best imitate demonstrations provided by a\nteaching oracle h \u2208 H. Recent work [58, 8, 60] has given practical heuristics for this task for various\n\nFigure 3: Depicting the PROPEL meta-\nalgorithm.\n\n.\n\n3\n\nH<latexit sha1_base64=\"FjVpgZ9sYnF4JX/Xxf2CItINeeg=\">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRTZcV7AOmQ8mkmTY0kwzJHaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmAhuwHW/ndLG5tb2Tnm3srd/cHhUPT7pGpVqyjpUCaX7ITFMcMk6wEGwfqIZiUPBeuH0Pvd7T0wbruQjzBIWxGQsecQpASv5g5jAhBKRtebDas2tuwvgdeIVpIYKtIfVr8FI0TRmEqggxviem0CQEQ2cCjavDFLDEkKnZMx8SyWJmQmyReQ5vrDKCEdK2ycBL9TfGxmJjZnFoZ3MI5pVLxf/8/wUotsg4zJJgUm6/ChKBQaF8/vxiGtGQcwsIVRzmxXTCdGEgm2pYkvwVk9eJ91G3buqNx6ua827oo4yOkPn6BJ56AY1UQu1UQdRpNAzekVvDjgvzrvzsRwtOcXOKfoD5/MHfD+RYg==</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>UpdateF<latexit sha1_base64=\"uVEU6tZiG0qwOqgTWDNX6SrYYHs=\">AAACBnicbVBNS8NAEN34WetX1KMIwSJ4KkkV7LEgiMcKpi20IWw2m3bp5oPdiVBCTl78K148KOLV3+DNf+MmzUFbHww83pthZp6XcCbBNL+1ldW19Y3N2lZ9e2d3b18/OOzJOBWE2iTmsRh4WFLOImoDA04HiaA49Djte9Prwu8/UCFZHN3DLKFOiMcRCxjBoCRXPxmFGCYMMjvxMdDczUqBYJ7d5LmrN8ymWcJYJlZFGqhC19W/Rn5M0pBGQDiWcmiZCTgZFsAIp3l9lEqaYDLFYzpUNMIhlU5WvpEbZ0rxjSAWqiIwSvX3RIZDKWehpzqLG+WiV4j/ecMUgraTsShJgUZkvihIuQGxUWRi+ExQAnymCCaCqVsNMsECE1DJ1VUI1uLLy6TXaloXzdbdZaPTruKooWN0is6Rha5QB92iLrIRQY/oGb2iN+1Je9HetY9564pWzRyhP9A+fwCaZZnQ</latexit>Project\u21e7<latexit sha1_base64=\"yJrGS/uwBUglQp5t/VvEbT6FBk0=\">AAACD3icbVC7SgNBFJ2Nrxhfq5Y2i0GxCrtR0DJoYxnBPCAbwuzkJhkz+2DmrhiW/QMbf8XGQhFbWzv/xtlkC008MHA49zHnHi8SXKFtfxuFpeWV1bXiemljc2t7x9zda6owlgwaLBShbHtUgeABNJCjgHYkgfqegJY3vsrqrXuQiofBLU4i6Pp0GPABZxS11DOPXYQHnO5JPBFDmrg+xRHHpC7DO2CY9tw6T3tm2a7YU1iLxMlJmeSo98wvtx+y2IcAmaBKdRw7wm5CJXImIC25sYKIsjEdQkfTgPqgusnUR2odaaVvDUKpX4DWVP09kVBfqYnv6c7MrJqvZeJ/tU6Mg4tuwoMoRgjY7KNBLCwMrSwcq8+lPllMNKFMcu3VYiMqKUMdYUmH4MyfvEia1YpzWqnenJVrl3kcRXJADskJccg5qZFrUicNwsgjeSav5M14Ml6Md+Nj1low8pl98gfG5w9vop4p</latexit>\u21e7<latexit sha1_base64=\"wssQ+renNbqLlT7M0IUVY1jBbiA=\">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeClx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNM7uZ+54lrI2L1iNOE+xEdKREKRtFKD/2mGJQrbtVdgKwTLycVyNEclL/6w5ilEVfIJDWm57kJ+hnVKJjks1I/NTyhbEJHvGepohE3frY4dUYurDIkYaxtKSQL9fdERiNjplFgOyOKY7PqzcX/vF6K4a2fCZWkyBVbLgpTSTAm87/JUGjOUE4toUwLeythY6opQ5tOyYbgrb68Ttq1qndVrd1fV+qNPI4inME5XIIHN1CHBjShBQxG8Ayv8OZI58V5dz6WrQUnnzmFP3A+fwAiOY23</latexit>\fAlgorithm 2 UPDATEF : neural policy gradient for mixed policies\n1: Input: Neural Policy Class F.\n2: Input: Step size: \u03b7.\n3: Initialize neural policy: f0\n4: for j = 1, . . . , m do\n5:\n6: end for\n7: Return: h \u2261 \u03c0 + \u03bbfm\n\nfj \u2190 fj\u22121 \u2212 \u03b7\u03bb\u2207F J(\u03c0 + \u03bbfj\u22121)\n\nInput: Regularization parameter: \u03bb\n\n//any standard randomized initialization\n\nInput: Reference programmatic policy: \u03c0\n\n//using DDPG [36], TRPO [47], etc., holding \u03c0 \ufb01xed\n\nAlgorithm 3 PROJECT\u03a0: program synthesis via imitation learning\n1: Input: Programmatic Policy Class: \u03a0.\nInput: Oracle policy: h\n2: Roll-out h on environment, get trajectory: \u03c40 = (s0, h(s0), s1, h(s1), . . .)\n3: Create supervised demonstration set: \u03930 = {(s, h(s))} from \u03c40\n4: Derive \u03c00 from \u03930 via program synthesis\n5: for k = 1, . . . , M do\n6:\n7:\n8:\n9:\n10: end for\n11: Return: \u03c0M\n\nRoll-out \u03c0k\u22121, creating trajectory: \u03c4k\nCollect demonstration data: \u0393(cid:48) = {(s, h(s))|s \u2208 \u03c4k}\n\u0393k \u2190 \u0393(cid:48) \u222a \u0393k\u22121\nDerive \u03c0k from \u0393k via program synthesis\n\n//DAgger-style imitation learning [46]\n\n//e.g., using methods in [58, 8]\n\n//e.g., using methods in [58, 8]\n\nprogrammatic policy classes. Algorithm 3 shows one instantiation of PROJECT\u03a0 (based on DAgger\n[46]). One complication that arises is that \ufb01nite-sample runs of such imitation learning approaches\nonly return approximate solutions and so the projection is not exact. We characterize the impact of\napproximate projections in Section 4.\nPractical Considerations. In practice, we often employ multiple gradient steps before taking a\nprojection step (as also described in Algorithm 2), because the step size of individual (stochastic) gra-\ndient updates can be quite small. Another issue that arises in virtually all policy gradient approaches\nis that the gradient estimates can have very high variance [55, 33, 30]. We utilize low-variance policy\ngradient updates by using the reference \u03c0 as a proximal regularizer in function space [19].\nFor the projection step (Algorithm 3), in practice we often retain all previous roll-outs \u03c4 from all\nprevious projection steps. It is straightforward to query the current oracle h to provide demonstrations\non the states s \u2208 \u03c4 from previous roll-outs, which can lead to substantial savings in sample complexity\nwith regards to executing roll-outs on the environment, while not harming convergence.\n\n4 Theoretical Analysis\n\nWe start by viewing PROPEL through the lens of online learning in function space, independent\nof the speci\ufb01c parametric representation. This start point yields a convergence analysis of Alg.\n1 in Section 4.1 under generic approximation errors. We then analyze the issues of policy class\nrepresentation in Sections 4.2 and 4.3, and connect Algorithms 2 and 3 with the overall performance,\nunder some simplifying conditions. In particular, Section 4.3 characterizes the update error in a\npossibly non-differentiable setting; to our knowledge, this is the \ufb01rst such analysis of its kind for\nreinforcement learning.\nPreliminaries. We consider \u03a0 and F to be subspaces of an ambient policy space U, which is a\nvector space equipped with inner product (cid:104)\u00b7,\u00b7(cid:105), induced norm (cid:107)u(cid:107) =\n(cid:104)u, u(cid:105), dual norm (cid:107)v(cid:107)\u2217\nsup{(cid:104)v, u(cid:105)|(cid:107)u(cid:107) \u2264 1}, and standard scaling & addition: (au + bv)(s) = au(s) + bv(s) for a, b \u2208 R\nc(s, u(s))d\u00b5u(s), where \u00b5u is the\ndistribution of states induced by u. The joint policy class is H = \u03a0 \u2295 F, by H = {\u03c0 + f|\u2200\u03c0 \u2208\n\u03a0, f \u2208 F}.2 Note that H is a subspace of U, and inherits its vector space properties. Without\naffecting the analysis, we simply equate U \u2261 H for the remainder of the paper.\nWe assume that J is convex in H, which implies that subgradient \u2202J(h) exists (with respect to\nH) [9]. Where J is differentiable, we utilize the notion of a Fr\u00e9chet gradient. Recall that a\nbounded linear operator \u2207 : H (cid:55)\u2192 H is called a Fr\u00e9chet functional gradient of J at h \u2208 H if\n\nand u, v \u2208 U. The cost functional of a policy u is J(u) = (cid:82)\n\n(cid:112)\n\n=\n\nS\n\n2The operator \u2295 is not a direct sum, since \u03a0 and F are not orthogonal.\n\n4\n\n\f)\n\n(cid:107)g(cid:107)\n\nJ(h+g)\u2212J(h)\u2212(cid:104)\u2207J(h),g(cid:105)\n\n\u03a0(ht) \u2248 argmin\u03c0\u2208\u03a0 DR(\u03c0, ht)\n\n3. Obtain approximate projection: \u03c0t = PROJECTR\n\n= 0. By default, \u2207 (or \u2207H for emphasis) denotes the gradient with\n\nlim\n(cid:107)g(cid:107)\u21920\nrespect to H, whereas \u2207F de\ufb01nes the gradient in the restricted subspace F.\n4.1 PROPEL as (Approximate) Functional Mirror Descent\nFor our analysis, PROPEL can be viewed as approximating mirror descent in (in\ufb01nite-dimensional)\nfunction space over a convex set \u03a0 \u2282 H.3 Similar to the \ufb01nite-dimensional setting [40], we choose a\nstrongly convex and smooth functional regularizer R to be the mirror map. From the approximate\nmirror descent perspective, for each iteration t:\n\n2 (cid:107)h(cid:107)2\nDR(u, v) = R(u) \u2212 R(v) \u2212 (cid:104)\u2207R(u), u \u2212 v(cid:105) is a Bregman divergence. Taking R(h) = 1\nwill recover projected functional gradient descent in L2-space. Here UPDATE becomes ht =\n\n1. Obtain a noisy gradient estimate: (cid:98)\u2207t\u22121 \u2248 \u2207J(\u03c0t\u22121)\n2. UPDATEH(\u03c0) in H space: \u2207R(ht) = \u2207R(\u03c0t\u22121) \u2212 \u03b7(cid:98)\u2207t\u22121 (Note UPDATEH (cid:54)= UPDATEF\n\u03c0t\u22121 \u2212 \u03b7(cid:98)\u2207J(\u03c0t\u22121), and PROJECT solves for argmin\u03c0\u2208\u03a0 (cid:107)\u03c0 \u2212 ht(cid:107)2. While we mainly focus on\ngradient estimate (cid:98)\u2207t may be biased, in addition to having high variance. One potential source of\n\nthis choice of R in our experiments, note that other selections of R lead to different UPDATE and\nPROJECT operators (e.g., minimizing KL divergence if R is negative entropy).\nThe functional mirror descent scheme above may encounter two additional sources of error compared\nto standard mirror descent [40]. First, in the stochastic setting (also called bandit feedback [28]), the\n\nbias is the gap between UPDATEH and UPDATEF . Second, the PROJECT step may be inexact. We\nstart by analyzing the behavior of PROPEL under generic bias, variance, and projection errors, before\ndiscussing the implications of approximating UPDATEH and PROJECT\u03a0 by Algs. 2 & 3, respectively.\nLet the bias be bounded by \u03b2, i.e.,\n\u2217 \u2264 \u03b2 almost surely. Similarly let the\nvariance of the gradient estimate be bounded by \u03c32, and the projection error norm (cid:107)\u03c0t \u2212 \u03c0\u2217t (cid:107) \u2264 \u0001.\nWe state the expected regret bound below; more details and a proof appear in Appendix A.2.\nTheorem 4.1 (Expected regret bound under gradient estimation and projection errors). Let \u03c01, . . . , \u03c0T\nbe a sequence of programmatic policies returned by Algorithm 1, and \u03c0\u2217 be the optimal programmatic\npolicy. Choosing learning rate \u03b7 =\n\n(cid:13)(cid:13)(cid:13)E[(cid:98)\u2207t|\u03c0t] \u2212 \u2207J(\u03c0t)\n(cid:13)(cid:13)(cid:13)\n(cid:113) 1\n(cid:35)\n\nT + \u0001), we have the expected regret over T iterations:\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\n\u03c32 ( 1\n\n(cid:34)\n\nE\n\n1\nT\n\nT(cid:88)\n\nt=1\n\nJ(\u03c0t)\n\n\u2212 J(\u03c0\u2217) = O\n\n\u03c3\n\n1\nT\n\n+ \u0001 + \u03b2\n\n.\n\n(2)\n\n(cid:112)\n\nThe result shows that error \u0001 from PROJECT and the bias \u03b2 do not accumulate and simply contribute\nan additive term on the expected regret.4 The effect of variance of gradient estimate decreases at a\n1/T rate. Note that this regret bound is agnostic to the speci\ufb01c UPDATE and PROJECT operations,\n\nand can be applied more generically beyond the speci\ufb01c algorithmic choices used in our paper.\n4.2 Finite-Sample Analysis under Vanilla Policy Gradient Update and DAgger Projection\nNext, we show how certain instantiations of UPDATE and PROJECT affect the magnitude of errors and\nin\ufb02uence end-to-end learning performance from \ufb01nite samples, under some simplifying assumptions\non the UPDATE step. For this analysis, we simplify Alg. 2 into the case UPDATEF \u2261 UPDATEH.\nIn particular, we assume programmatic policies in \u03a0 to be parameterized by a vector \u03b8 \u2208 Rk, and\n\u03c0 is differentiable in \u03b8 (e.g., we can view \u03a0 \u2282 F where F is parameterized in Rk). We further\nassume the trajectory roll-out is performed in an exploratory manner, where action is taken uniformly\nrandom over \ufb01nite set of A actions, thus enabling the bound on the bias of gradient estimates via\nBernstein\u2019s inequality. The PROJECT step is consistent with Alg. 3, i.e., using DAgger [45] under\nconvex imitation loss, such as (cid:96)2 loss. We have the following high-probability guarantee:\nTheorem 4.2 (Finite-sample guarantee). At each iteration, we perform vanilla policy gradient\nestimate of \u03c0 (over H) using m trajectories and, use DAgger algorithm to collect M roll-outs for the\n3\u03a0 can be convexi\ufb01ed by considering randomized policies, as stochastic combinations of \u03c0 \u2208 \u03a0 (cf. [35]).\n4Other mirror descent-style analyses, such as in [52], lead to accumulation of errors over the rounds of\nlearning T . One key difference is that we are leveraging the assumption of convexity of J in the (in\ufb01nite-\ndimensional) function space representation.\n\n5\n\n\fT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:115)\n\n\uf8eb\uf8ed\u03c3\n\n(cid:114)\n\n(cid:114)\n\uf8f6\uf8f8 + O\n(cid:32)\n\n\u03c3\n\n(cid:0) 1\n(cid:114)\n\n(cid:113) log(T /\u03b4)\n\nM\n\n(cid:1), after T\n(cid:33)\n\nimitation learning projection. Setting the learning rate \u03b7 =\nrounds of the algorithm, we have that:\n\n1\n\u03c32\n\nT + H\n\nM +\n\nJ(\u03c0t) \u2212 J(\u03c0\n\n\u2217\n\n) \u2264 O\n\n1\nT\n\n+\n\nH\nM\n\n+\n\nlog(T /\u03b4)\n\nM\n\nlog(T k/\u03b4)\n\nAH log(T k/\u03b4)\n\n+\n\nm\n\nm\n\nholds with probability at least 1 \u2212 \u03b4, with H being the task horizon, A the cardinality of action space,\n\u03c32 the variance of policy gradient estimates, and k the dimension \u03a0\u2019s parameterization.\n\nand UPDATEF\n\nThe expanded result and proof are included in Appendix A.3. The proof leverages previous analysis\nfrom DAgger [46] and the \ufb01nite sample analysis of vanilla policy gradient algorithm [32]. The\n\ufb01nite-sample regret bound scales linearly with the standard deviation \u03c3 of the gradient estimate, while\nthe bias, which is the very last component of the RHS, scales linearly with the task horizon H. Note\nthat the standard deviation \u03c3 can be exponential in task horizon H in the worst case [32], and so it is\nimportant to have practical implementation strategies to reduce the variance of the UPDATE operation.\nWhile conducted in a stylized setting, this analysis provides insight in the relative trade-offs of\nspending effort in obtaining more accurate projections versus more reliable gradient estimates.\n4.3 Closing the gap between UPDATEH\nOur functional mirror descent analysis rests on taking gradients in H: UPDATEH(\u03c0) involves\nestimating \u2207HJ(\u03c0) in the H space. On the other hand, Algorithm 2 performs UPDATEF (\u03c0) only in\nthe neural policy space F. In either case, although J(\u03c0) may be differentiable in the non-parametric\nambient policy space, it may not be possible to obtain a differentiable parametric programmatic\nrepresentation in \u03a0. In this section, we discuss theoretical motivations to addressing a practical issue:\nHow do we de\ufb01ne and approximate the gradient \u2207HJ(\u03c0) under a parametric representation? To our\nknowledge, we are the \ufb01rst to consider such a theoretical question for reinforcement learning.\nDe\ufb01ning a consistent approximation of \u2207HJ(\u03c0). The idea in UPDATEF (\u03c0) (Line 8 of Alg. 1) is\nto approximate \u2207HJ(\u03c0) by \u2207F J(f ), which has a differentiable representation, at some f close to \u03c0\n(under the norm). Under appropriate conditions on F, we show that this approximation is valid.\nProposition 4.3. Assume that (i) J is Fr\u00e9chet differentiable on H, (ii) J is also differentiable on\nthe restricted subspace F, and (iii) F is dense in H (i.e., the closure F = H). Then for any\n\ufb01xed policy \u03c0 \u2208 \u03a0, de\ufb01ne a sequence of policies fk \u2208 F, k = 1, 2, . . .), that converges to \u03c0:\nlimk\u2192\u221e (cid:107)fk \u2212 \u03c0(cid:107) = 0. We then have limk\u2192\u221e (cid:107)\u2207F J(fk) \u2212 \u2207HJ(\u03c0)(cid:107)\u2217\nSince the Fr\u00e9chet gradient is unique in the ambient space H, \u2200k we have \u2207HJ(fk) = \u2207F J(fk) \u2192\n\u2207HJ(\u03c0) as k \u2192 \u221e (by Proposition 4.3). We thus have an asymptotically unbiased approximation of\n\u2207HJ(\u03c0) via differentiable space F as: \u2207F J(\u03c0) (cid:44) \u2207HJ(\u03c0) (cid:44) limk\u2192\u221e \u2207F J(fk).5 Connecting to\nbound becomes E(cid:104) 1\nthe result from Theorem 4.1, let \u03c32 be an upper bound on the policy gradient estimates in the neural\npolicy class F, under an asymptotically unbiased approximation of \u2207HJ(\u03c0), the expected regret\n\n(cid:113) 1\n\n(cid:80)T\n\n= 0.\n\n(cid:16)\n\n(cid:17)\n\n(cid:105)\n\n\u03c3\n\nT\n\nt=1 J(\u03c0t)\n\nT + \u0001\n\n.\n\n\u2212 J(\u03c0\u2217) = O\n\nBias-variance considerations of UPDATEF (\u03c0) To further theoretically motivate a practical strategy\nfor UPDATEF (\u03c0) in Algorithm 2, we utilize an equivalent proximal perspective of mirror descent\n[10], where UPDATEH(\u03c0) is equivalent to solving for h(cid:48) = argminh\u2208H\n\u03b7(cid:104)\u2207HJ(\u03c0), h(cid:105) + DR(h, \u03c0).\nProposition 4.4 (Minimizing a relaxed objective). For a \ufb01xed programmatic policy \u03c0, with suf\ufb01ciently\nsmall constant \u03bb \u2208 (0, 1), we have that\n\nJ(cid:0)\u03c0 + \u03bbf(cid:1)\n\n\u2212 J(\u03c0) + (cid:104)\u2207J(\u03c0), \u03c0(cid:105)\n\n(3)\n\nmin\nh\u2208H\n\n\u03b7(cid:104)\u2207HJ(\u03c0), h)(cid:105) + DR(h, \u03c0) \u2264 min\nf\u2208F\n\nThus, a relaxed UPDATEH step is obtained by minimizing the RHS of (3), i.e., minimizing J(\u03c0 + \u03bbf )\nover f \u2208 F. Each gradient descent update step is now f(cid:48) = f \u2212 \u03b7\u03bb\u2207F J(\u03c0t + \u03bbf ), corresponding\nto Line 5 of Algorithm 2. For \ufb01xed \u03c0 and small \u03bb, this relaxed optimization problem becomes\nregularized policy optimization over F, which is signi\ufb01cantly easier. Functional regularization in\npolicy space around a \ufb01xed prior controller \u03c0 has demonstrated signi\ufb01cant reduction in the variance\n5We do not assume J(\u03c0) to be differentiable when restricting to the policy subspace \u03a0, i.e., \u2207\u03a0J(\u03c0) may\n\nnot exist under policy parameterization of \u03a0.\n\n6\n\n\fof gradient estimate [19], at the expense of some bias. The below expected regret bound summarizes\nthe impact of this increased bias and reduced variance, with details included in Appendix A.5.\n(cid:16)\nProposition 4.5 (Bias-variance characterization of UPDATEF ). Assuming J(h) is L-strongly smooth\nover H, i.e., \u2207HJ(h) is L-Lipschitz continuous, approximating UPDATEH\nby UPDATEF per Alg. 2\n\u03bb\u03c3\n\nleads to the expected regret bound: E(cid:104) 1\n\n(cid:113) 1\nT + \u0001 + \u03bb2L2(cid:17)\n\n(cid:80)T\n\n(cid:105)\n\nT\n\nt=1 J(\u03c0t)\n\n\u2212 J(\u03c0\u2217) = O\n\n.\n\nCompared to the idealized unbiased approximation in Proposition 4.3, the introduced bias here is\nrelated to the inherent smoothness property of cost functional J(h) over the joint policy class H, i.e.,\nhow close J(\u03c0 + \u03bbf ) is to its linear under-approximation J(\u03c0) + (cid:104)\u2207HJ(\u03c0), \u03bbf(cid:105) around \u03c0.\n5 Experiments\n\nWe demonstrate the effectiveness of PROPEL in synthesizing programmatic controllers in three\ncontinuous control environments. For brevity and focus, this section primarily focuses on TORCS6, a\nchallenging race car simulator environment [59]. Empirical results on two additional classic control\ntasks, Mountain-Car and Pendulum, are provided in Appendix B; those results follow similar trends\nas the ones described for TORCS below, and further validate the convergence analysis of PROPEL.\nExperimental Setup. We evaluate over \ufb01ve distinct\ntracks in the TORCS simulator. The dif\ufb01culty of a\ntrack can be characterized by three properties; track\nlength, track width, and number of turns. Our suite\nof tracks provides environments with varying levels of\ndif\ufb01culty for the learning algorithm. The performance\nof a policy in the TORCS simulator is measured by the\nlap time achieved on the track. To calculate the lap\ntime, the policies are allowed to complete a three-lap\nrace, and we record the best lap time during this race.\nWe perform the experiments with twenty-\ufb01ve random\nseeds and report the median lap time over these twenty-\n\ufb01ve trials. Some of the policies crash the car before\ncompleting a lap on certain tracks, even after training\nfor 600 episodes. Such crashes are recorded as a lap\ntime of in\ufb01nity while calculating the median. If the policy crashes for more than half the seeds, this\nis reported as CR in Tables 1 & 2. We choose to report the median because taking the crash timing as\nin\ufb01nity, or an arbitrarily large constant, heavily skews other common measures such as the mean.\nBaselines. Among recent state-of-the-art approaches\nto learning programmatic policies are NDPS [58] for\nhigh-level language policies, and VIPER [8] for learn-\ning tree-based policies. Both NDPS and VIPER rely\non imitating a \ufb01xed (pre-trained) neural policy oracle,\nand can be viewed as degenerate versions of PROPEL\nthat only run Lines 4-6 in Algorithm 1. We present\ntwo PROPEL analogues to NDPS and VIPER: (i) PRO-\nPELPROG: PROPEL using the high-level language of\nFigure 1 as the class of programmatic policies, similar\nto NDPS. (ii) PROPELTREE: PROPEL using regres-\nsion trees, similar to VIPER. We also report results for\nPRIOR, which is a (sub-optimal) PID controller that is\nalso used as the initial policy in PROPEL. In addition, to\nstudy generalization ability as well as safety behavior\nduring training, we also include DDPG, a neural policy learned using the Deep Deterministic Policy\nGradients [36] algorithm, with 600 episodes of training for each track. In principle, PROPEL and its\nanalysis can accommodate different policy gradient subroutines. However, in the TORCS domain,\nother policy gradient algorithms such as PPO and TRPO failed to learn policies that are able to\ncomplete the considered tracks. We thus focus on DDPG as our main policy gradient component.\n\nFigure 5: Median number of crashes during\ntraining of DDPG and PROPELPROG over\n25 random seeds.\n\nFigure 4: Median lap-time improvements\nduring multiple iterations of PROPELPROG\nover 25 random seeds.\n\n6The code for the TORCS experiments can be found at: https://bitbucket.org/averma8053/propel\n\n7\n\n012345Iterations050100150200Lap Time ImprovementG-TrackE-RoadAalborgRuudskogenAlpine-212345Track ID0100200300400500600Number of CrashesMax EpisodesDDPGPROPEL-Prog\fTable 1: Performance results in TORCS over 25 random seeds. Each entry is formatted as Lap-time /\nCrash-ratio, reporting median lap time in seconds over all the seeds (lower is better) and ratio of\nseeds that result in crashes (lower is better). A lap time of CR indicates the agent crashed and could\nnot complete a lap for more than half the seeds.\n\nLENGTH\nPRIOR\nDDPG\nNDPS\nVIPER\nPROPELPROG\nPROPELTREE\n\nG-TRACK\n\n3186M\n\n312.92 / 0.0\n78.82 / 0.24\n108.25 / 0.24\n83.60 / 0.24\n93.67 / 0.04\n78.33 / 0.04\n\nE-ROAD\n3260M\n\n322.59 / 0.0\n89.71 / 0.28\n126.80 / 0.28\n87.53 / 0.28\n119.17 / 0.04\n79.39 / 0.04\n\nAALBORG\n\n2588M\n\n244.19 / 0.0\n101.06 / 0.40\n163.25 / 0.40\n110.57 / 0.40\n147.28 / 0.12\n109.83 / 0.16\n\nRUUDSKOGEN\n\n3274M\n\n340.29 / 0.0\nCR / 0.68\nCR / 0.68\nCR / 0.68\n\n124.58 / 0.16\n118.80 / 0.24\n\nALPINE-2\n\n3774M\n\n402.89 / 0.0\nCR / 0.92\nCR / 0.92\nCR / 0.92\n\n256.59 / 0.16\n236.01 / 0.36\n\nTable 2: Generalization results in TORCS, where rows are training and columns are testing tracks.\nEach entry is formatted as PROPELPROG / DDPG, and the number reported is the median lap time in\nseconds over all the seeds (lower is better). CR indicates the agent crashed and could not complete a\nlap for more than half the seeds.\n\nG-TRACK\nE-ROAD\nAALBORG\nRUUDSKOGEN\nALPINE-2\n\nG-TRACK\n\n-\n\n102 / 92\n201 / 91\n131 / CR\n222 / CR\n\nE-ROAD\n124 / CR\n\n-\n\n228 / CR\n135 / CR\n231 / CR\n\nAALBORG\nCR / CR\nCR / CR\n\n-\n\nCR / CR\n184 / CR\n\nRUUDSKOGEN ALPINE-2\nCR / CR\nCR / CR\nCR / CR\nCR / CR\n\nCR / CR\nCR / CR\n217 / CR\n\n-\n\nCR / CR\n\n-\n\nEvaluating Performance. Table 1 shows the performance on the considered TORCS tracks. We\nsee that PROPELPROG and PROPELTREE consistently outperform the NDPS [58] and VIPER [8]\nbaselines, respectively. While DDPG outperforms PROPEL on some tracks, its volatility causes it\nto be unable to learn in some environments, and hence to crash the majority of the time. Figure 4\nshows the consistent improvements made over the prior by PROPELPROG, over the iterations of the\nPROPEL algorithm. Appendix B contains similar results achieved on the two classic control tasks,\nMountainCar and Pendulum. Figure 5 shows that, compared to DDPG, our approach suffers far fewer\ncrashes while training in TORCS.\nEvaluating Generalization. To compare the ability of the controllers to perform on tracks not seen\nduring training, we executed the learned policies on all the other tracks (Table 2). We observe that\nDDPG crashes signi\ufb01cantly more often than PROPELPROG. This demonstrates the generalizability\nof the policies returned by PROPEL. Generalization results for the PROPELTREE policy are given\nin the appendix. In general, PROPELTREE policies are more generalizable than DDPG but less than\nPROPELPROG. On an absolute level, the generalization ability of PROPEL still leaves much room for\nimprovement, which is an interesting direction for future work.\nVeri\ufb01ability of Policies. As shown in prior work [8, 58], parsimonious programmatic policies are\nmore amenable to formal veri\ufb01cation than neural policies. Unsurprisingly, the policies generated by\nPROPELTREE and PROPELPROG are easier to verify than DDPG policies. As a concrete example,\nwe veri\ufb01ed a smoothness property of the PROPELPROG policy using the Z3 SMT-solver [21] (more\ndetails in Appendix B). The veri\ufb01cation terminated in 0.49 seconds.\nInitialization. In principle, PROPEL can be initialized with a random program, or a random policy\ntrained using DDPG. In practice, the performance of PROPEL depends to a certain degree on the\nstability of the policy gradient procedure, which is DDPG in our experiments. Unfortunately, DDPG\noften exhibits high variance across trials and fares poorly in challenging RL domains. Speci\ufb01cally, in\nour TORCS experiments, DDPG fails on a number of tracks (similar phenomena have been reported in\nprevious work that experiments on similar continuous control domains [30, 19, 58]). Agents obtained\nby initializing PROPEL with neural policies obtained via DDPG also fail on multiple tracks. Their\nperformance over the \ufb01ve tracks is reported in Appendix B. In contrast, PROPEL can often \ufb01nish the\nchallenging tracks when initialized with a very simple hand-crafted programmatic prior.\n\n8\n\n\f6 Related Work\n\nProgram Synthesis. Program synthesis is the problem of automatically searching for a program\nwithin a language that \ufb01ts a given speci\ufb01cation [29]. Recent approaches to the problem have leveraged\nsymbolic knowledge about program structure [27], satis\ufb01ability solvers [50, 31], and meta-learning\ntechniques [39, 41, 22, 7] to generate interesting programs in many domains [3, 42, 4]. In most prior\nwork, the speci\ufb01cation is a logical constraint on the input/output behavior of the target program.\nHowever, there is also a growing body of work that considers program synthesis modulo optimality\nobjectives [13, 15, 43], often motivated by machine learning tasks [39, 57, 26, 23, 58, 8, 60].\nSynthesis of programs that imitates an oracle has been considered in both the logical [31] and the\noptimization [58, 8, 60] settings. The projection step in PROPEL builds on this prior work. While our\ncurrent implementation of this step is entirely symbolic, in principle, the operation can also utilize\ncontemporary techniques for learning policies that guide the synthesis process [39, 7, 49].\nConstrained Policy Learning. Constrained policy learning has seen increased interest in recent\nyears, largely due to the desire to impose side guarantees such as stability and safety on the policy\u2019s\nbehavior. Broadly, there are two approaches to imposing constraints: specifying constraints as an\nadditional cost function [1, 35], and explicitly encoding constraints into the policy class [2, 34, 19,\n20, 12]. In some cases, these two approaches can be viewed as duals of each other. For instance,\nrecent work that uses control-theoretic policies as a functional regularizer [34, 19] can be viewed\nfrom the perspective of both regularization (additional cost) and an explicitly constrained policy class\n(a speci\ufb01c mix of neural and control-theoretic policies). We build upon this perspective to develop\nthe gradient update step in our approach.\nRL using Imitation Learning. There are two ways to utilize imitation learning subroutines within\nRL. First, one can leverage limited-access or sub-optimal experts to speed up learning [44, 18, 14, 51].\nSecond, one can learn over two policy classes (or one policy and one model class) to achieve\naccelerated learning compared to using only one policy class [38, 17, 52, 16]. Our approach has\nsome stylistic similarities to previous efforts [38, 52] that use a richer policy space to search for\nimprovements before re-training the primary policy to imitate the richer policy. One key difference is\nthat our primary policy is programmatic and potentially non-differentiable. A second key difference\nis that our theoretical framework takes a functional gradient descent perspective \u2014 it would be\ninteresting to carefully compare with previous analysis techniques to \ufb01nd a unifying framework.\nRL with Mirror Descent. The mirror descent framework has previously used to analyze and\ndesign RL algorithms. For example, Thomas et al. [56] and Mahadevan and Liu [37] use composite\nobjective mirror descent, or COMID [25], which allows incorporating adaptive regularizers into\ngradient updates, thus offering connections to either natural gradient RL [56] or sparsity inducing RL\nalgorithms [37]. Unlike in our work, these prior approaches perform projection into the same native,\ndifferentiable representation. Also, the analyses in these papers do not consider errors introduced by\nhybrid representations and approximate projection operators. However, one can potentially extend\nour approach with versions of mirror descent, e.g., COMID, that were considered in these efforts.\n\n7 Conclusion and Future Work\n\nWe have presented PROPEL, a meta-algorithm based on mirror descent, program synthesis, and\nimitation learning, for programmatic reinforcement learning (PRL). We have presented theoretical\nconvergence results for PROPEL, developing novel analyses to characterize approximate projections\nand biased gradients within the mirror descent framework. We also validated PROPEL empirically, and\nshow that it can discover interpretable, veri\ufb01able, generalizable, performant policies and signi\ufb01cantly\noutperform the state of the art in PRL.\nThe central idea of PROPEL is the use of imitation learning and combinatorial methods in implement-\ning a projection operation for mirror descent, with the goal of optimization in a functional space that\nlacks gradients. While we have developed PROPEL in an RL setting, this idea is not restricted to RL\nor even sequential decision making. Future work will seek to exploit this insight in other machine\nlearning and program synthesis settings.\nAcknowledgements. This work was supported in part by United States Air Force Contract # FA8750-19-C-0092,\nNSF Award # 1645832, NSF Award # CCF-1704883, the Okawa Foundation, Raytheon, PIMCO, and Intel.\n\n9\n\n\fReferences\n[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.\n\nIn\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22\u201331. JMLR.\norg, 2017.\n\n[2] Mohammed Alshiekh, Roderick Bloem, R\u00fcdiger Ehlers, Bettina K\u00f6nighofer, Scott Niekum, and Ufuk\nIn Thirty-Second AAAI Conference on Arti\ufb01cial\n\nTopcu. Safe reinforcement learning via shielding.\nIntelligence, 2018.\n\n[3] Rajeev Alur, Rastislav Bod\u00edk, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas Kress-\nGazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia,\nRishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In\nDependable Software Systems Engineering, pages 1\u201325. 2015.\n\n[4] Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. Scaling enumerative program synthesis via divide\nand conquer. In Tools and Algorithms for the Construction and Analysis of Systems - 23rd International\nConference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of\nSoftware, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part I, pages 319\u2013336, 2017.\n[5] Kiam Heong Ang, Gregory Chong, and Yun Li. Pid control system analysis, design, and technology. IEEE\n\ntransactions on control systems technology, 13(4):559\u2013576, 2005.\n\n[6] Karl Johan \u00c5str\u00f6m and Tore H\u00e4gglund. Automatic tuning of simple regulators with speci\ufb01cations on phase\n\nand amplitude margins. Automatica, 20(5):645\u2013651, 1984.\n\n[7] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder:\nLearning to write programs. In 5th International Conference on Learning Representations, ICLR 2017,\nToulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.\n\n[8] Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Veri\ufb01able reinforcement learning via policy\n\nextraction. In Advances in Neural Information Processing Systems, pages 2494\u20132504, 2018.\n\n[9] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert\n\nspaces, volume 408. Springer, 2011.\n\n[10] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[11] Richard Bellman, Irving Glicksberg, and Oliver Gross. On the \u201cbang-bang\u201d control problem. Quarterly of\n\nApplied Mathematics, 14(1):11\u201318, 1956.\n\n[12] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforce-\nment learning with stability guarantees. In Advances in neural information processing systems, pages\n908\u2013918, 2017.\n\n[13] Roderick Bloem, Krishnendu Chatterjee, Thomas A. Henzinger, and Barbara Jobstmann. Better quality in\nsynthesis through quantitative objectives. In Computer Aided Veri\ufb01cation, 21st International Conference,\nCAV 2009, Grenoble, France, June 26 - July 2, 2009. Proceedings, pages 140\u2013156, 2009.\n\n[14] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum\u00e9 III, and John Langford. Learning to\n\nsearch better than your teacher. In International Conference on Machine Learning (ICML), 2015.\n\n[15] Swarat Chaudhuri, Martin Clochard, and Armando Solar-Lezama. Bridging boolean and quantitative\n\nsynthesis using smoothed proof search. In POPL, pages 207\u2013220, 2014.\n\n[16] Ching-An Cheng, Xinyan Yan, Nathan Ratliff, and Byron Boots. Predictor-corrector policy optimization.\n\nIn International Conference on Machine Learning (ICML), 2019.\n\n[17] Ching-An Cheng, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Accelerating imitation learning\nwith predictive models. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2019.\n\n[18] Ching-An Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation\n\nand reinforcement. In Uncertainty in arti\ufb01cial intelligence, 2019.\n\n[19] Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control\nregularization for reduced variance reinforcement learning. In International Conference on Machine\nLearning (ICML), 2019.\n\n[20] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa.\n\nSafe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.\n\n[21] Leonardo Mendon\u00e7a de Moura and Nikolaj Bj\u00f8rner. Z3: An Ef\ufb01cient SMT Solver. In TACAS, pages\n\n337\u2013340, 2008.\n\n[22] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet\nKohli. Robust\ufb01ll: Neural program learning under noisy i/o. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 990\u2013998. JMLR. org, 2017.\n\n10\n\n\f[23] Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando\nSolar-Lezama, and Wojciech Matusik. Inversecsg: automatic conversion of 3d models to CSG trees. ACM\nTrans. Graph., 37(6):213:1\u2013213:16, 2018.\n\n[24] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u20131338,\n2016.\n\n[25] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent.\n\nIn COLT, pages 14\u201326, 2010.\n\n[26] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics\nIn Advances in Neural Information Processing Systems, pages\n\nprograms from hand-drawn images.\n6059\u20136068, 2018.\n\n[27] John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-\noutput examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language\nDesign and Implementation, Portland, OR, USA, June 15-17, 2015, pages 229\u2013239, 2015.\n\n[28] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the\nbandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM\nsymposium on Discrete algorithms, pages 385\u2013394. Society for Industrial and Applied Mathematics, 2005.\n[29] Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and Trends in\n\nProgramming Languages, 4(1-2):1\u2013119, 2017.\n\n[30] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep\nreinforcement learning that matters. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n[31] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. Oracle-guided component-based program\nsynthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-\nVolume 1, pages 215\u2013224. ACM, 2010.\n\n[32] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis,\n\nUniversity of London London, England, 2003.\n\n[33] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing\n\nsystems, pages 1008\u20131014, 2000.\n\n[34] Hoang M. Le, Andrew Kang, Yisong Yue, and Peter Carr. Smooth imitation learning for online sequence\n\nprediction. In International Conference on Machine Learning (ICML), 2016.\n\n[35] Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International\n\nConference on Machine Learning (ICML), 2019.\n\n[36] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint\narXiv:1509.02971, 2015.\n\n[37] Sridhar Mahadevan and Bo Liu. Sparse q-learning with mirror descent. In Proceedings of the Twenty-Eighth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 564\u2013573. AUAI Press, 2012.\n\n[38] William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In\n\nAdvances in Neural Information Processing Systems, pages 4008\u20134016, 2016.\n\n[39] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for conditional\n\nprogram generation. In ICLR, 2018.\n\n[40] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method ef\ufb01ciency\n\nin optimization. 1983.\n\n[41] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet\n\nKohli. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.\n\n[42] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In\nProceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming,\nSystems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA,\nOctober 25-30, 2015, pages 107\u2013126, 2015.\n\n[43] Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. Learning programs from noisy data.\nIn Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming\nLanguages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, pages 761\u2013774, 2016.\n\n[44] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret\n\nlearning. arXiv preprint arXiv:1406.5979, 2014.\n\n[45] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured\nprediction to no-regret online learning. In Proceedings of the fourteenth international conference on\narti\ufb01cial intelligence and statistics, pages 627\u2013635, 2011.\n\n11\n\n\f[46] St\u00e9phane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured\nprediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages\n627\u2013635, 2011.\n\n[47] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n[48] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[49] Xujie Si, Yuan Yang, Hanjun Dai, Mayur Naik, and Le Song. Learning a meta-solver for syntax-guided\nprogram synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans,\nLA, USA, May 6-9, 2019, 2019.\n\n[50] Armando Solar-Lezama, Liviu Tancau, Rastislav Bod\u00edk, Sanjit A. Seshia, and Vijay A. Saraswat. Combi-\n\nnatorial sketching for \ufb01nite programs. In ASPLOS, pages 404\u2013415, 2006.\n\n[51] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement\n\nlearning & imitation learning. In International Conference on Learning Representations (ICLR), 2018.\n\n[52] Wen Sun, Geoffrey J Gordon, Byron Boots, and J Bagnell. Dual policy iteration. In Advances in Neural\n\nInformation Processing Systems, pages 7059\u20137069, 2018.\n\n[53] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated:\nIn International Conference on Machine\n\nDifferentiable imitation learning for sequential prediction.\nLearning (ICML), 2017.\n\n[54] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.\n[55] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods\nfor reinforcement learning with function approximation. In Advances in neural information processing\nsystems, pages 1057\u20131063, 2000.\n\n[56] Philip S Thomas, William C Dabney, Stephen Giguere, and Sridhar Mahadevan. Projected natural\n\nactor-critic. In Advances in neural information processing systems, pages 2337\u20132345, 2013.\n\n[57] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. Houdini:\nLifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pages\n8687\u20138698, 2018.\n\n[58] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Pro-\ngrammatically interpretable reinforcement learning. In International Conference on Machine Learning,\npages 5052\u20135061, 2018.\n\n[59] Bernhard Wymann, Eric Espi\u00e9, Christophe Guionneau, Christos Dimitrakakis, R\u00e9mi Coulom, and Andrew\n\nSumner. TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.\n\n[60] He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework\nIn ACM Conference on Programming Language Design and\n\nfor veri\ufb01able reinforcement learning.\nImplementation (SIGPLAN), 2019.\n\n12\n\n\f", "award": [], "sourceid": 9215, "authors": [{"given_name": "Abhinav", "family_name": "Verma", "institution": "Rice University"}, {"given_name": "Hoang", "family_name": "Le", "institution": "California Institute of Technology"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}, {"given_name": "Swarat", "family_name": "Chaudhuri", "institution": "Rice University"}]}