{"title": "Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition", "book": "Advances in Neural Information Processing Systems", "page_first": 8538, "page_last": 8547, "abstract": "The design of a reward function often poses a major practical challenge to real-world applications of reinforcement learning. Approaches such as inverse reinforcement learning attempt to overcome this challenge, but require expert demonstrations, which can be difficult or expensive to obtain in practice. We propose inverse event-based control, which generalizes inverse reinforcement learning methods to cases where full demonstrations are not needed, such as when only samples of desired goal states are available. Our method is grounded in an alternative perspective on control and reinforcement learning, where an agent's goal is to maximize the probability that one or more events will happen at some point in the future, rather than maximizing cumulative rewards. We demonstrate the effectiveness of our methods on continuous control tasks, with a focus on high-dimensional observations like images where rewards are hard or even impossible to specify.", "full_text": "Variational Inverse Control with Events: A General\n\nFramework for Data-Driven Reward De\ufb01nition\n\nJustin Fu\u2217 Avi Singh\u2217 Dibya Ghosh Larry Yang\n\nSergey Levine\n\n{justinfu, avisingh, dibyaghosh, larrywyang, svlevine}@berkeley.edu\n\nUniversity of California, Berkeley\n\nAbstract\n\nThe design of a reward function often poses a major practical challenge to real-\nworld applications of reinforcement learning. Approaches such as inverse rein-\nforcement learning attempt to overcome this challenge, but require expert demon-\nstrations, which can be dif\ufb01cult or expensive to obtain in practice. We propose\nvariational inverse control with events (VICE), which generalizes inverse reinforce-\nment learning methods to cases where full demonstrations are not needed, such as\nwhen only samples of desired goal states are available. Our method is grounded in\nan alternative perspective on control and reinforcement learning, where an agent\u2019s\ngoal is to maximize the probability that one or more events will happen at some\npoint in the future, rather than maximizing cumulative rewards. We demonstrate\nthe effectiveness of our methods on continuous control tasks, with a focus on high-\ndimensional observations like images where rewards are hard or even impossible\nto specify.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) has shown remarkable promise in recent years, with results on a range\nof complex tasks such as robotic control (Levine et al., 2016) and playing video games (Mnih\net al., 2015) from raw sensory input. RL algorithms solve these problems by learning a policy\nthat maximizes a reward function that is considered as part of the problem formulation. There is\nlittle practical guidance that is provided in the theory of RL about how these rewards should be\ndesigned. However, the design of the reward function is in practice critical for good results, and\nreward misspeci\ufb01cation can easily cause unintended behavior (Amodei et al., 2016). For example, a\nvacuum cleaner robot rewarded to pick up dirt could exploit the reward by repeatedly dumping dirt on\nthe ground and picking it up again (Russell & Norvig, 2003). Additionally, it is often dif\ufb01cult to write\ndown a reward function at all. For example, when learning policies from high-dimensional visual\nobservations, practitioners often resort to using motion capture (Peng et al., 2017) or specialized\ncomputer vision systems (Rusu et al., 2017) to obtain rewards.\nAs an alternative to reward speci\ufb01cation, imitation learning (Argall et al., 2009) and inverse rein-\nforcement learning (Ng & Russell, 2000) instead seek to mimic expert behavior. However, such\napproaches require an expert to show how to solve a task. We instead propose a novel problem\nformulation, variational inverse control with events (VICE), which generalizes inverse reinforcement\nlearning to alternative forms of expert supervision. In particular, we consider cases when we have\nexamples of a desired \ufb01nal outcome, rather than full demonstrations, so the expert only needs to\nshow what the desired outcome of a task is (see Figure 1). A straightforward way to make use of\nthese desired outcomes is to train a classi\ufb01er (Pinto & Gupta, 2016; Tung et al., 2018) to distinguish\ndesired and undesired states. However, for these approaches it is unclear how to correctly sample\nnegatives and whether using this classi\ufb01er as a reward will result in the intended behavior, since an\n\n\u2217equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Standard IRL requires full expert demon-\nstrations and aims to produce an agent that mimics the\nexpert. VICE generalizes IRL to cases where we only\nobserve \ufb01nal desired outcomes, which does not require\nthe expert to actually know how to perform the task.\n\nRL agent can learn to exploit the classi\ufb01er, in the same way it can exploit human-designed rewards.\nOur framework provides a more principled approach, where classi\ufb01er training corresponds to learning\nprobabilistic graphical model parameters (see Figure 2), and policy optimization corresponds to\ninferring the optimal actions. By selecting an inference query which corresponds to our intentions,\nwe can mitigate reward hacking scenarios similar to those previously described, and also specify the\ntask with examples rather than manual engineering.\nOur inverse formulation is based on a corre-\nsponding forward control framework which re-\nframes control as inference in a graphical model.\nOur framework resembles prior work (Kappen\net al., 2009; Toussaint, 2009; Rawlik et al.,\n2012), but we extend this connection by re-\nplacing the conventional notion of rewards with\nevent occurence variables. Rewards correspond\nto log-probabilities of events, and value func-\ntions can be interpreted as backward messages\nthat represent log-probabilities of those events\noccurring. This framework retains the full ex-\npressivity of RL, since any rewards can be ex-\npressed as log-probabilities, while providing\nmore intuitive guidance on task speci\ufb01cation.\nIt further allows us to express various intentions,\nsuch as for an event to happen at least once, ex-\nactly once at any time step, or once at a speci\ufb01c timestep. Crucially, our framework does not require\nthe agent to observe the event happening, but only to know the probability that it occurred. While\nthis may seem unusual, it is more practical in the real world, where success may be determined\nby probabilistic models that themselves carry uncertainty. For example, the previously mentioned\nvacuum cleaner robot needs to estimate from its observations whether its task has been accomplished\nand would never receive direct feedback from the real world whether a room is clean.\nOur contributions are as follows. We \ufb01rst in-\ntroduce the event-based control framework by\nextending previous control as inference work to\nalternative queries which we believe to be use-\nful in practice. This view on control can ease\nthe process of reward engineering by mapping\na user\u2019s intention to a corresponding inference\nquery in a probabilistic graphical model. Our\nexperiments demonstrate how different queries\nFigure 2: Our framework learns event probabilities\ncan result in different behaviors which align with\nfrom data. We use neural networks as function approx-\nthe corresponding intentions. We then propose\nimators to model this distribution, which allows us to\nmethods to learn event probabilities from data,\nwork with high dimensional observations like images.\nin a manner analogous to inverse reinforcement\nlearning. This corresponds to the use case where designing event probabilities by hand is dif\ufb01cult,\nbut observations (e.g., images) of successful task completion are easier to provide. This approach is\nsubstantially easier to apply in practical situations, since full demonstrations are not required. Our\nexperiments demonstrate that our framework can be used in this fashion for policy learning from\nhigh dimensional visual observations where rewards are hard to specify. Moreover, our method\nsubstantially outperforms baselines such as sparse reward RL, indicating that our framework provides\nan automated shaping effect when learning events, making it feasible to solve otherwise hard tasks.\n\n2 Related work\n\nOur reformulation of RL is based on the connection between control and inference (Kappen et al.,\n2009; Ziebart, 2010; Rawlik et al., 2012). The resulting problem is sometimes referred to as maximum\nentropy reinforcement learning, or KL control. Duality between control and inference in the case of\nlinear dynamical systems has been studied in Kalman (1960); Todorov (2008). Maximum entropy\nobjectives can be optimized ef\ufb01ciently and exactly in linearly solvable MDPs (Todorov, 2007) and\n\n2\n\nconv layersfc layers\fenvironments with discrete states. In linear-quadratic systems, control as inference techniques have\nbeen applied to solve path planning problems for robotics (Toussaint, 2009). In the context of deep\nRL, maximum entropy objectives have been used to derive soft variants of Q-learning and policy\ngradient algorithms (Haarnoja et al., 2017; Schulman et al., 2017; O\u2019Donoghue et al., 2016; Nachum\net al., 2017). These methods embed the standard RL objective, formulated in terms of rewards, into\nthe framework of probabilistic inference. In contrast, we aim speci\ufb01cally to reformulate RL in a way\nthat does not require specifying arbitrary scalar-valued reward functions.\nIn addition to studying inference problems in a control setting, we also study the problem of learning\nevent probabilities in these models. This is related to prior work on inverse reinforcement learning\n(IRL), which has also sought to cast learning of objectives into the framework of probabilistic\nmodels (Ziebart et al., 2008; Ziebart, 2010). As explained in Section 5, our work generalizes IRL to\ncases where we only provide examples of a desired outcome or goal, which is signi\ufb01cantly easier to\nprovide in practice since we do not need to know how to achieve the goal.\nReward design is crucial for obtaining the desired behavior from RL agents (Amodei et al., 2016).\nNg & Russell (2000) showed that rewards can be modi\ufb01ed, or shaped, to speed up learning without\nchanging the optimal policy. Singh et al. (2010) study the problem of optimal reward design, and\nintroduce the concept of a \ufb01tness function. They observe that a proxy reward that is distinct from the\n\ufb01tness function might be optimal under certain settings, and Sorg et al. (2010) study the problem\nof how this optimal proxy reward can be selected. Had\ufb01eld-Menell et al. (2017) introduce the\nproblem of inferring the true objective based on the given reward and MDP. Our framework aids task\nspeci\ufb01cation by introducing two decisions: the selection of the inference query that is of interest (i.e.,\nwhen and how many times should the agent cause the event?), and the speci\ufb01cation of the event of\ninterest. Moreover, as discussed in Section 6, we observe that our method automatically provides a\nreward shaping effect, allowing us to solve otherwise hard tasks.\n\n3 Preliminaries\n\nIn this section we introduce our notation and summarize how control can be framed as infer-\nence. Reinforcement learning operates on Markov decision processes (MDP), de\ufb01ned by the tuple\n(S,A,T , r, \u03b3, \u03c10). S,A are the state and action spaces, respectively, r is a reward function, which is\ntypically taken to be a scalar \ufb01eld on S \u00d7 A, and \u03b3 \u2208 (0, 1) is the discount factor. T and \u03c10 represent\nthe dynamics and initial state distributions, respectively.\n\n3.1 Control as inference\n\ns2\n\na2\n\ne2\n\n. . .\n\ns1\n\na1\n\ns3\n\na3\n\nIn order to cast control as an inference problem,\nwe begin with the standard graphical model for\nan MDP, which consists of states and actions.\nWe incorporate the notion of a goal with an ad-\nditional variable et that depends on the state\n(and possibly also the action) at time step t,\naccording to p(et|st, at).\nIf the goal is spec-\ni\ufb01ed with a reward function, we can de\ufb01ne\np(et = 1|st, at) = er(s,a) which, as we dis-\ncuss below, leads to a maximum entropy version\nof the standard RL framework. This requires the\nrewards to be negative, which is not restrictive\nin practice, since if the rewards are bounded we\ncan re-center them so that the maximum value\nis 0. The structure of this model is presented in\nFigure 3, and is also considered in prior work, as discussed in the previous section.\nThe maximum entropy reinforcement learning objective emerges when we condition on e1:T = 1.\nConsider computing a backward message \u03b2(st, at) = p(et:T = 1|st, at). Letting Q(st, at) =\nlog \u03b2(st, at), notice that the backward messages encode the backup equations\n\nFigure 3: A graphical model framework for control. In\nmaximum entropy reinforcement learning, we observe\ne1:T = 1 and can perform inference on the trajectory to\nobtain a policy.\n\n. . .\n\n. . .\n\nsT\n\naT\n\neT\n\ne1\n\ne3\n\nQ(st, at) = r(st, at) + log Est+1[eV (st+1)]\n\nV (st) = log\n\n3\n\n(cid:90)\n\neQ(st,a)da .\n\na\u2208A\n\n\fWe include the full derivation in Appendix A, which resembles derivations discussed in prior\nwork (Ziebart et al., 2008). This backup equation corresponds to maximum entropy RL, and is\nequivalent to soft Q-learning and causal entropy RL formulations in the special case of deterministic\ndynamics (Haarnoja et al., 2017; Schulman et al., 2017). For the case of stochastic dynamics,\nmaximum-entropy RL is optimistic with respect to the dynamics and produces risk-seeking behavior,\nand we refer the reader to Appendix B, which covers a variational derivation of the policy objective\nwhich properly handles stochastic dynamics.\n\n4 Event-based control\nIn control as inference, we chose log p(et = 1|st, at) = r(s, a) so that the resulting inference problem\nmatches the maximum entropy reinforcement learning objective. However, we might also ask: what\ndoes the variable et, and its probability, represent? The connection to graphical models lets us\ninterpret rewards as the log-probability that an event occurs, and the standard approach to reward\ndesign can also be viewed as specifying the probability of some binary event, that we might call an\noptimality event. This provides us with an alternative way to think about task speci\ufb01cation: rather\nthan using arbitrary scalar \ufb01elds as rewards, we can specify the events for which we would like to\nmaximize the probability of occurrence.\nWe now outline inference procedures for different types of problems of interest in the graphical\nmodel depicted in Figure 3. In Section 5, we will discuss learning procedures in this graphical\nmodel which allow us to specify objectives from data. The strength of the events framework for\ntask speci\ufb01cation lies in both its intuitive interpretation and \ufb02exibility: though we can obtain similar\nbehavior in standard reinforcement learning, it may require considerable reward tuning and changes\nto the overall problem statement, including the dynamics. In contrast, events provides a single uni\ufb01ed\nframework where the problem parameters remain unchanged, and we simply ask the appropriate\nqueries. We will discuss:\n\n\u2022 ALL query: p(\u03c4|e1:T = 1), meaning the event should happen at each time step.\n\u2022 AT query: p(\u03c4|et\u2217 = 1), meaning the event should happen at a speci\ufb01c time t\u2217.\n\u2022 ANY query: p(\u03c4|e1 = 1 or e2 = 1 or ... eT = 1) meaning the event should happen on at\n\nleast one time step during each trial.\n\nWe present two derivations for each query: a conceptually simple one based on maximum entropy\nand message passing (see Section 3.1), and one based on variational inference, (see Appendix B),\nwhich is more appropriate for stochastic dynamics. The resulting variational objective is of the form:\n\nJ(\u03c0) = \u2212DKL(\u03c0(\u03c4 )||p(\u03c4|evidence)) = Es1:T ,a1:T \u223cq[ \u02c6Q(s1:T , a1:T ) + H \u03c0(\u00b7|s1:T )],\n\nwhere \u02c6Q is an empirical Q-value estimator for a trajectory and H \u03c0(\u00b7|s1:T ) = \u2212(cid:80)T\n\nt=0 log \u03c0(at|st)\nrepresents the entropy of the policy. This form of the objective can be used in policy gradient\nalgorithms, and in special cases can also be written as a recursive backup equation for dynamic\nprogramming algorithms. We directly present our results here, and present more detailed derivations\n(including extensions to discounted cases) in Appendices C and D.\n\n4.1 ALL and AT queries\n\nWe begin by reviewing the ALL query, when we wish for an agent to trigger an event at every timestep.\nThis can be useful, for example, when expressing some continuous task such as maintaining some\nsort of con\ufb01guration (such as balancing on a unicycle) or avoiding an adverse outcome, such as not\ncausing an autonomous car to collide. As covered in Section 3.1, conditioning on the event at all time\nsteps mathematically corresponds to the same problem as entropy maximizing RL, with the reward\ngiven by log p(et = 1|st, at).\nTheorem 4.1 (ALL query). In the ALL query, the message passing update for the Q-value can be\nwritten as:\nwhere Q(st, at) represents the log-message log p(et:T = 1|st, at). The corresponding empirical\nQ-value can be written recursively as:\n\nQ(st, at) = log p(et = 1|st, at) + log Est+1 [eV (st+1)],\n\n\u02c6Q(st:T , at:T ) = log p(et = 1|st, at) + \u02c6Q(st+1:T , at+1:T ).\n\n4\n\n\fProof. See Appendices C.1 and D.1\n\nThe AT query, or querying for the event at a speci\ufb01c time step, results in the same equations, except\nlog p(e = 1|st, at), is only given at the speci\ufb01ed time t\u2217. While we generally believe that the ANY\nquery presented in the following section will be more broadly applicable, there may be scenarios\nwhere an agent needs to be in a particular con\ufb01guration or location at the end of an episode. In these\ncases, the AT query would be the most appropriate.\n\n4.2 ANY query\n\nThe ANY query speci\ufb01es that an event should happen at least once before the end of an episode,\nwithout regard for when in particular it takes place. Unlike the ALL and AT queries, the ANY query\ndoes not correspond to entropy maximizing RL and requires a new backup equation. It is also in\nmany cases more appropriate: if we would like an agent to accomplish some goal, we might not care\nwhen in particular that goal is accomplished, and we likely don\u2019t need it to accomplish it more than\nonce. This query can be useful for specifying behaviors such as reaching a goal state, completion of a\ntask, etc. Let the stopping time t\u2217 = min{t \u2265 0|et = 1} denote the \ufb01rst time that the event occurs.\nTheorem 4.2 (ANY query). In the ANY query, the message passing update for the Q-value can be\nwritten as:\n\n(cid:16)\n(cid:17)\np(et = 1|st, at) + p(et = 0|st, at)Est+1[eV (st+1)]\np(et = 1|st, at) + p(et = 0|st, at)e \u02c6Q(st+1:T ,at+1:T )(cid:17)\n(cid:16)\n\n.\n\nQ(st, at) = log\n\n\u02c6Q(st:T , at:T ) = log\n\nwhere Q(st, at) represents the log-message log p(t \u2264 t\u2217 \u2264 T|st, at). The corresponding empirical\nQ-value can be written recursively as:\n\nProof. See Appendices C.2 and D.2\n\nThis query is related to \ufb01rst-exit RL problems, where an agent receives a reward of 1 when a speci\ufb01ed\ngoal is reached and is immediately moved to an absorbing state but it does not require the event\nto actually be observed, which makes it applicable to a variety of real-world situations that have\nuncertainty over the goal. The backup equations of the ANY query are equivalent to the \ufb01rst-exit\nproblem when p(e|s, a) is deterministic. This can be seen by setting p(e = 1|s, a) = rF (s, a), where\nrF (s, a) is an goal indicator function that denotes the reward of the \ufb01rst-exit problem. In this case,\nwe have Q(s, a) = 0 if the goal is reachable, and Q(s, a) = \u2212\u221e if not. In the \ufb01rst-exit case, we have\nQ(s, a) = 1 if the goal is reachable and Q(s, a) = 0 if not - both cases result in the same policy.\n\n4.3 Sample-based optimization using policy gradients\n\nIn small, discrete settings with known dynamics, we can use the backup equations in the previous\nsection to solve for optimal policies with dynamic programming. For large problems with unknown\ndynamics, we can also derive model-free analogues to these methods, and apply them to complex\ntasks with high-dimensional function approximators. One commonly used method is the policy\ngradient, and which we can derive via logarithmic differentiation as:\n\n\u2207\u03b8J(\u03b8) = \u2212\u2207\u03b8DKL(\u03c0\u03b8(\u03c4 )||p(\u03c4|evidence))\n\n(cid:34) T(cid:88)\n\n= Es1:T ,a1:T \u223c\u03c0\u03b8\n\n\u2207 log \u03c0\u03b8(at|st)( \u02c6Q(s1:T , a1:T ) + H \u03c0(\u00b7|st:T ))\n\nt=1\n\nUnder certain assumptions we can replace \u02c6Q(s1:T , a1:T ) with \u02c6Q(st:T , at:T ) to obtain an estimator\nwhich only depends on future returns. See Appendix E for further explanation.\nThis estimator can be integrated into standard policy gradient algorithms, such as TRPO Schulman\net al. (2015), to train expressive inference models using neural networks. Extensions of our approach\nto other RL methods with function approximation, such as Q-learning and approximate dynamic\nprogramming, can also be derived from the backup equations, though this is outside the scope of the\npresent work.\n\n5\n\n(cid:35)\n\n\fAlgorithm 1 VICE: Variational Inverse Control with Events\n1: Obtain examples of expert states and actions sE\n2: Initialize policy \u03c0 and binary discriminator D\u03b8.\n3: for step n in {1, . . . , N} do\n4:\n5:\n6:\n7: end for\n\nCollect states and actions si = (s1, ..., sT ), ai = (a1, ..., aT ) by executing \u03c0.\nTrain D\u03b8 via logistic regression to classify expert data sE\nUpdate \u03c0 with respect to p\u03b8 using the appropriate inference objective.\n\ni , aE\ni\n\ni , aE\n\ni from samples si, ai.\n\n5 Learning event probabilities from data\n\nIn the previous section, we presented a control framework that operates on events rather than reward\nfunctions, and discussed how the user can choose from among a variety of inference queries to obtain\na desired outcome. However, the event probabilities must still be obtained in some way, and may be\ndif\ufb01cult to hand-engineer in many practical situations - for example, an image-based deep RL system\nmay need an image classi\ufb01er to determine if it has accomplished its goal. In such situations, we can\nask the user to instead supply examples of states or observations where the event has happened, and\nlearn the event probabilities p\u03b8(e = 1|s, a). Inverse reinforcement learning corresponds to the case\nwhen we assume the expert triggers an event at all timesteps (the ALL query), in which case we\nrequire full demonstrations. However, if we assume the expert is optimal under an ANY or AT query,\nfull demonstrations are not required because the event is not assumed to be triggered at each timestep.\nThis means our supervision can be of the form of a desired set of states rather than full trajectories.\nFor example, in the vision-based robotics case, this means that we can specify goals using images of\na desired goal state, which are much easier to obtain than full demonstrations.\nFormally, for each query, we assume our dataset of states and actions (s, a) \u223c pdata(s, a|e = 1)\nwhen the event has happened, assuming the data-generating policy follows one of our inference\nqueries. Our objective is imitation: we wish to train a model which produces samples that match the\ndata. To that end, we learn the parameters of the model p\u03b8(s, a|e = 1), trained with the maximum\nlikelihood objective:\n\nL(\u03b8) = \u2212Epdata [log p\u03b8(s, a|e = 1)]\n\nThe gradient of this model is:\n\n\u2207\u03b8L(\u03b8) = \u2212Epdata [\u2207\u03b8 log p\u03b8(s, a|e = 1)] + Ep\u03b8 [\u2207\u03b8 log p\u03b8(s, a|e = 1)]\n\n(1)\nWhere the second term corresponds to the gradient of the partition function of p\u03b8(s, a|e = 1). Thus,\nthis implies an algorithm where we sample states and actions from the model p\u03b8 and use them to\ncompute the gradient update.\n\n5.1 Sample-based optimization with discriminators\n\nIn high-dimensional settings, a convenient method to perform the gradient update in Eqn. 1 is to\nembed the model p\u03b8(s, a|evidence) within a discriminator between samples p\u03b8 and data pdata and\ntake the gradient of the cross-entropy loss. Second, in order to draw samples from the model we\ninstead train a \"generator\" policy via variational inference to draw samples from p\u03b8. The variational\ninference procedure corresponds to those outlined in Section 4.\nSpeci\ufb01cally, we adapt the method of Fu et al. (2018), which alternates between training a discriminator\nwith the \ufb01xed form\n\nD\u03b8(s, a) = p\u03b8(s, a)/(p\u03b8(s, a) + \u03c0(a|s))\n\nto distinguish between policy samples and success states, and a policy that minimizes the KL\ndivergence between DKL(\u03c0(s, a)||p\u03b8(s, a| = 1)). As shown in previous work (Finn et al., 2016b; Fu\net al., 2018), the gradient of the cross entropy loss of the discriminator is equivalent to the gradient\nof Eqn. 1, and using the reward log D\u03b8(s, a) \u2212 log(1 \u2212 D\u03b8(s, a)) with the appropriate inference\nobjective is equivalent to minimizing KL between the sampler and generator. We show the latter\nequivalence in Appendix F, and pseudocode for our algorithm is presented in Algorithm 1\n\n6\n\n\f6 Experimental evaluation\n\nOur experimental evaluation aims to answer the following questions: (1) How does the behavior of an\nagent change depending on the choice of query? We study this question in the case where the event\nprobabilities are already speci\ufb01ed. (2) Does our event learning framework (VICE) outperform simple\nalternatives, such as of\ufb02ine classi\ufb01er training, when learning event probabilities from data? We study\nthis question in settings where it is dif\ufb01cult to manually specify a reward function, such as when the\nagent receives raw image observations. (3) Does learning event probabilities provide better shaped\nrewards than the ground truth event occurrence indicators? Additional videos and supplementary\nmaterial are available at https://sites.google.com/view/inverse-event.\n\n6.1\n\nInference with pre-speci\ufb01ed event probabilities\n\nWe \ufb01rst demonstrate how the ANY and ALL queries in\nour framework result in different behaviors. We adapt\nTRPO (Schulman et al., 2015), a natural policy gradient\nalgorithm, to train policies using our query procedures\nderived in Section 4. Our examples involve two goal-\nreaching domains, HalfCheetah and Lobber, shown in\nFigure 4: HalfCheetah and Lobber tasks.\nFigure 4. The goal of HalfCheetah is to navigate a 6-DoF\nagent to a goal position, and in Lobber, a robotic arm must throw an block to a goal position. To study\nthe inference process in isolation, we manually design the event probabilities as e\u2212||xagent\u2212xtarget||2\nfor the HalfCheetah and e\u2212(cid:107)xblock\u2212xgoal(cid:107)2 for the Lobber.\nThe experimental results are shown in Table 1.\nWhile the average distance to the goal for both\nqueries was roughly the same, the ANY query\nresults in a much closer minimum distance. This\nmakes sense, since in the ALL query the agent\nis punished for every time step it is not near the\ngoal. The ANY query can afford to receive lower\ncumulative returns and instead has max-seeking\nbehavior which more accurately reaches the tar-\nget. Here, the ANY query better expresses our\nintention of reaching a target.\n\nAvg. Dist\n1.35 (0.20)\n1.33 (0.16)\nHalfCheetah-Random 8.95 (5.37)\n0.61 (0.12)\n0.59 (0.11)\n0.93 (0.01)\n\nTable 1: Results on HalfCheetah and Lobber tasks\n(5 trials). The ALL query generally results in superior\nreturns, but the ANY query results in the agent reaching\nthe target more accurately. Random refers to a random\ngaussian policy.\n\nMin. Dist\n0.97 (0.46)\n2.01 (0.48)\n5.41 (2.67)\n0.25 (0.20)\n0.36 (0.21)\n0.91 (0.01)\n\nHalfCheetah-ANY\nHalfCheetah-ALL\n\nLobber-ANY\nLobber-ALL\n\nLobber-Random\n\nQuery\n\n6.2 Learning event probabilities\n\nQuery type\n\nTable 2: Results on Maze, Ant and Pusher environments (5 trials).\nThe metric reported is the \ufb01nal distance to the goal state (lower is\nbetter). VICE performs better than the classi\ufb01er-based setup on all\nthe tasks, and the performance is substantially better for the Ant\nand Pusher task. Detailed learning curves are provided in Appendix\nG.\n\nWe now compare our event proba-\nbility learning framework, which we\ncall variational inverse control with\nevents (VICE), against an of\ufb02ine clas-\nsi\ufb01er training baseline. We also com-\npare our method to learning from\ntrue binary event indicators, to see if\nour method can provide some reward\nshaping bene\ufb01ts to speed up the learn-\ning process. The data for learning\nevent probabilities comes from suc-\ncess states. That is, we have access to\na set of states {sE\ni }i=1...n, which may\nhave been provided by the user, for which we know the event took place. This setting generalizes\nIRL, where instead of entire expert demonstrations, we simply have examples of successful states.\nThe of\ufb02ine classi\ufb01er baseline trains a neural network to distinguish success state (\"positives\") from\nstates collected by a random policy. The number of positives and negatives in this procedure is kept\nbalanced. This baseline is a reasonable and straightforward method to specify rewards in the standard\nRL framework, and provides a natural point of comparison to our approach, which can also be viewed\nas learning a classi\ufb01er, but within the principled framework of control as inference. We evaluate these\nmethods on the following tasks:\n\nVICE (ours)\n0.20 (0.19)\n0.23 (0.15)\n0.64 (0.32)\n0.62 (0.55)\n0.09 (0.01)\n0.11 (0.01)\n\ne ALL\nANY\nM\nt ALL\nn\nANY\nA\nh ALL\nANY\n\nClassi\ufb01er\n0.35 (0.29)\n0.37 (0.21)\n2.71 (0.75)\n3.93 (1.56)\n0.25 (0.01)\n0.25 (0.01)\n\nTrue Binary\n0.11 (0.01)\n\n1.61 (1.35)\n\n0.17 (0.03)\n\nz\na\n\ns\nu\nP\n\n7\n\n\fFigure 5: Visualizations of the Pusher, Maze, and Ant\ntasks. In the Maze and Ant tasks, the agent seeks to\nreach a pre-speci\ufb01ed goal position. In the Pusher task,\nthe agent seeks to place a block at the goal position.\n\nMaze from pixels. In this task, a point mass needs to navigate to a goal location through a small\nmaze, depicted in Figure 5. The observations consist of 64x64 RGB images that correspond to an\noverhead view of the maze. The action space consists of X and Y forces on the robot. We use CNNs\nto represent the policy and the event distributions, training with 1000 success states as supervision.\nAnt. In this task, a quadrupedal \u201cant\u201d (shown in Figure 5) needs to crawl to a goal location, placed\n3m away from its starting position. The state space contains joint angles and XYZ-coordinates of the\nant. The action space corresponds to joint torques. We use 500 success states as supervision.\nPusher from pixels. In this task, a 7-DoF robotic arm (shown in Figure 5) must push a cylinder\nobject to a goal location. The state space contains joint angles, joint velocities and 64x64 RGB\nimages, and the action space corresponds to joint torques. We use 10K success states as supervision.\nTraining details and neural net architectures can\nbe found in Appendix G. We also compare our\nmethod against a reinforcement learning base-\nline that has access to the true binary event indi-\ncator. For all the tasks, we de\ufb01ne a \u201cgoal region\u201d,\nand give the agent a +1 reward when it is in the\ngoal region, and 0 otherwise. Note that this RL\nbaseline, which is similar to vanilla RL from\nsparse rewards, \u201cobserves\u201d the event, providing\nit with additional information, while our model\nonly uses the event probabilities learned from the success examples and receives no other supervision.\nIt is included to provide a reference point on the dif\ufb01culty of the tasks. Results are summarized in\nTable 2, and detailed learning curves can be seen in Figure 6 and Appendix G. We note the following\nsalient points from these experiments.\nVICE outperforms na\u00efve classi\ufb01er. We ob-\nserve that for Maze, both the simple classi\ufb01er\nand our method (VICE) perform well, though\nVICE achieves lower \ufb01nal distance. In the Ant\nenvironment, VICE is crucial for obtaining good\nperformance, and the simple classi\ufb01er fails to\nsolve the task. Similarly, for the Pusher task,\nVICE signi\ufb01cantly outperforms the classi\ufb01er\n(which fails to solve the task). Unlike the na\u00efve\nclassi\ufb01er approach, VICE actively integrates\nnegative examples from the current policy into\nthe learning process, and appropriately models\nthe event probabilities together with the dynam-\nical properties of the task, analogously to IRL.\nShaping effect of VICE. For the more dif\ufb01cult\nant and pusher domains, VICE actually outper-\nforms RL with the true event indicators. We\nanalyze this shaping effect further in Figure 6: our framework obtains performance that is supe-\nrior to learning with true event indicators, while requiring much weaker supervision. This indi-\ncates that the event probability distribution learned by our method has a reward-shaping effect,\nwhich greatly simpli\ufb01es the policy search process. We further compare our method against a hand-\nengineered shaped reward, depicted in dashed lines in Figure 6. The engineered reward is given by\n\u22120.2\u2217(cid:107)xblock \u2212 xarm(cid:107)\u2212(cid:107)xblock \u2212 xgoal(cid:107), and is impossible to compute when we don\u2019t have access\nto xblock, which is usually the case when learning in the real world. We observe that our method\nachieves performance that is comparable to this engineered reward, indicating that our automated\nshaping effect is comparable to hand-engineered shaped rewards.\n\nFigure 6: Results on the Pusher task (lower is better),\naveraged across \ufb01ve random seeds. VICE signi\ufb01cantly\noutperforms the naive classi\ufb01er and true binary event\nindicators. Further, the performance is comparable to\nlearning from an oracle hand-engineered reward (de-\nnoted in dashed lines). Curves for the Ant and Maze\ntasks can be seen in Appendix G.\n\n7 Conclusion\n\nIn this paper, we described how the connection between control and inference can be extended to\nderive a reinforcement learning framework that dispenses with the conventional notion of rewards,\nand replaces them with events. Events have associated probabilities. which can either be provided\n\n8\n\n0200400600800Iterations0.080.100.120.140.160.180.200.220.240.26Final Distance from GoalBinary IndicatorCLS-ALLCLS-ANYOracleVICE-ALLVICE-ANY\fby the user, or learned from data. Recasting reinforcement learning into the event-based framework\nallows us to express various goals as different inference queries in the corresponding graphical model.\nThe case where we learn event probabilities corresponds to a generalization of IRL where rather than\nassuming access to expert demonstrations, we assume access to states and actions where an event\noccurs. IRL corresponds to the case where we assume the event happens at every timestep, and we\nextend this notion to alternate graphical model queries where events may happen at a single timestep.\n\nAcknowledgements\n\nThis research was supported by an ONR Young Investigator Program award, the National Science\nFoundation through IIS-1651843, IIS-1614653, and IIS-1700696, Berkeley DeepDrive, and donations\nfrom Google, Amazon, and NVIDIA.\n\nReferences\nAmodei, Dario, Olah, Chris, Steinhardt, Jacob, Christiano, Paul, Schulman, John, and Man\u00e9, Dan.\n\nConcrete problems in AI safety. ArXiv Preprint, abs/1606.06565, 2016.\n\nArgall, Brenna D., Chernova, Sonia, Veloso, Manuela, and Browning, Brett. A survey of robot\n\nlearning from demonstration. Robotics and autonomous systems, 57(5):469\u2013483, 2009.\n\nFinn, C., Tan, X., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. Deep spatial autoencoders for\n\nvisuomotor learning. In ICRA, 2016a.\n\nFinn, Chelsea, Christiano, Paul, Abbeel, Pieter, and Levine, Sergey. A connection between generative\nadversarial networks, inverse reinforcement learning, and energy-based models. abs/1611.03852,\n2016b.\n\nFu, Justin, Luo, Katie, and Levine, Sergey. Learning robust rewards with adversarial inverse\nreinforcement learning. In International Conference on Learning Representations (ICLR), 2018.\n\nHaarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, and Levine, Sergey. Reinforcement learning with\n\ndeep energy-based policies. In International Conference on Machine Learning (ICML), 2017.\n\nHad\ufb01eld-Menell, Dylan, Milli, Smitha, Abbeel, Pieter, Russell, Stuart J., and Dragan, Anca D. Inverse\n\nreward design. In NIPS, 2017.\n\nHo, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2016.\n\nKalman, Rudolf. A new approach to linear \ufb01ltering and prediction problems. 82:35\u201345, 1960.\n\nKappen, Hilbert J., Gomez, Vicenc, and Opper, Manfred. Optimal control as a graphical model\n\ninference problem. 2009.\n\nLevine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning (JMLR), 2016.\n\nMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, Petersen, Stig,\nBeattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra,\nDaan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, feb 2015. ISSN 0028-0836.\n\nNachum, O\ufb01r, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale. Bridging the gap between\nvalue and policy based reinforcement learning. In Advances in Neural Information Processing\nSystems (NIPS), 2017.\n\nNg, Andrew and Russell, Stuart. Algorithms for inverse reinforcement learning. In International\n\nConference on Machine Learning (ICML), 2000.\n\nO\u2019Donoghue, Brendan, Munos, Remi, Kavukcuoglu, Koray, and Mnih, Volodymyr. Combining\n\npolicy gradient and q-learning. 2016.\n\n9\n\n\fPeng, Xue Bin, Andrychowicz, Marcin, Zaremba, Wojciech, and Abbeel, Pieter. Sim-to-real transfer\n\nof robotic control with dynamics randomization. CoRR, abs/1710.06537, 2017.\n\nPinto, Lerrel and Gupta, Abhinav. Supersizing self-supervision: Learning to grasp from 50k tries and\n\n700 robot hours. In IEEE International Conference on Robotics and Automation (ICRA), 2016.\n\nRawlik, Konrad, Toussaint, Marc, and Vijayakumar, Sethu. On stochastic optimal control and\nreinforcement learning by approximate inference. In Robotics: Science and Systems (RSS), 2012.\n\nRussell, Stuart J. and Norvig, Peter. Arti\ufb01cial Intelligence: A Modern Approach. Pearson Education,\n\n2 edition, 2003. ISBN 0137903952.\n\nRusu, Andrei A., Vecerik, Matej, Roth\u00f6rl, Thomas, Heess, Nicolas, Pascanu, Razvan, and Hadsell,\nRaia. Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot\nLearning (CoRL), 2017.\n\nSchulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and Abbeel, Pieter. Trust Region\n\nPolicy Optimization. In International Conference on Machine Learning (ICML), 2015.\n\nSchulman, John, Chen, Xi, and Abbeel, Pieter. Equivalence between policy gradients and soft\n\nq-learning. 2017.\n\nSingh, S., Lewis, R., and Barto, A. Where do rewards come from? In Proceedings of the International\n\nSymposium on AI Inspired Biology - A Symposium at the AISB 2010 Convention, 2010.\n\nSorg, Jonathan, Singh, Satinder P., and Lewis, Richard L. Reward design via online gradient ascent.\n\nIn NIPS, 2010.\n\nTodorov, Emo. Linearly-solvable markov decision problems. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2007.\n\nTodorov, Emo. General duality between optimal control and estimation. In IEEE Conference on\n\nDecision and Control (CDC), 2008.\n\nToussaint, Marc. Robot trajectory optimization using approximate inference.\n\nConference on Machine Learning (ICML), 2009.\n\nIn International\n\nTung, Hsiao-Yu Fish, Harley, Adam W., Huang, Liang-Kang, and Fragkiadaki, Katerina. Reward\nlearning from narrated demonstrations. In Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\nZiebart, Brian. Modeling purposeful adaptive behavior with the principle of maximum causal entropy.\n\nPhD thesis, Carnegie Mellon University, 2010.\n\nZiebart, Brian, Maas, Andrew, Bagnell, Andrew, and Dey, Anind. Maximum entropy inverse\n\nreinforcement learning. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2008.\n\n10\n\n\f", "award": [], "sourceid": 5144, "authors": [{"given_name": "Justin", "family_name": "Fu", "institution": "UC Berkeley"}, {"given_name": "Avi", "family_name": "Singh", "institution": "UC Berkeley"}, {"given_name": "Dibya", "family_name": "Ghosh", "institution": "UC Berkeley"}, {"given_name": "Larry", "family_name": "Yang", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}