{"title": "Learning Safe Policies with Expert Guidance", "book": "Advances in Neural Information Processing Systems", "page_first": 9105, "page_last": 9114, "abstract": "We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the \"follow-the-perturbed-leader\" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.", "full_text": "Learning safe policies with expert guidance\n\nJessie Huang1\n\nFa Wu12\n\n{jiexi.huang,fa.wu2}@mcgill.ca, {dprecup,cai}@cs.mcgill.ca\n\n1School of Computer Science, McGill University\n\n2Zhejiang Demetics Medical Technology\n\nDoina Precup1\n\nYang Cai1\n\nAbstract\n\nWe propose a framework for ensuring safe behavior of a reinforcement learning\nagent when the reward function may be dif\ufb01cult to specify. In order to do this,\nwe rely on the existence of demonstrations from expert policies, and we provide\na theoretical framework for the agent to optimize in the space of rewards consis-\ntent with its existing knowledge. We propose two methods to solve the resulting\noptimization: an exact ellipsoid-based method and a method in the spirit of the\n\"follow-the-perturbed-leader\" algorithm. Our experiments demonstrate the behav-\nior of our algorithm in both discrete and continuous problems. The trained agent\nsafely avoids states with potential negative effects while imitating the behavior of\nthe expert in the other states.\n\n1\n\nIntroduction\n\nIn Reinforcement Learning (RL), agent behavior is driven by an objective function de\ufb01ned through\nthe speci\ufb01cation of rewards. Misspeci\ufb01ed rewards may lead to negative side effects [3], when the\nagent acts unpredictably responding to the aspects of the environment that the designer overlooked,\nand potentially causes harms to the environment or itself. As the environment gets richer and more\ncomplex, it becomes more challenging to specify and balance rewards for every one of its aspects.\nYet if we want to have some type of safety guarantees in terms of the behavior of an agent learned by\nRL once it is deployed in the real world, it is crucial to have a learning algorithm that is robust to\nmis-speci\ufb01cations.\nWe assume that the agent has some knowledge about the reward function either through past ex-\nperience or demonstrations from experts. The goal is to choose a robust/safe policy that achieves\nhigh reward with respect to any reward function that is consistent with the agent\u2019s knowledge 1.We\nformulate this as a maxmin learning problem where the agent chooses a policy and an adversary\nchooses a reward function that is consistent with the agent\u2019s current knowledge and minimizes the\nagent\u2019s reward. The goal of the agent is to learn a policy that maximizes the worst possible reward.\nWe assume that the reward functions are linear in some feature space. Our formulation has two\nappealing properties: (1) it allows us to combine demonstrations from multiple experts even though\nthey may disagree with each other; and (2) the training environment/MDP in which the experts\noperate need not be the same as the testing environment/MDP where the agent will be deployed, our\nresults hold as long as the testing and training MDPs share the same feature space. As an application,\nour algorithm can learn a maxmin robust policy in a new environment that contains a few features\nthat are not present in the training environment. See our gridworld experiment in Section 5.\nOur \ufb01rst result (Theorem 1) shows that given any algorithm that can \ufb01nd the optimal policy for\nan MDP in polynomial time, we can solve the maxmin learning problem exactly in polynomial\n\n1Note that the safety as used here is more in the context of AI safety, and a policy is safe because it is robust\n\nto misspeci\ufb01ed rewards and the consequent negative side effects.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftime. Our algorithm is based on a seminal result from combinatorial optimization \u2013 the equivalence\nbetween separation and optimization [9, 14] \u2013 and the ellipsoid method. To understand the dif\ufb01culty\nof our problem, it is useful to think of maxmin learning as a two-player zero-sum game between the\nagent and the adversary. The deterministic policies correspond to the pure strategies of the agent.\nThe consistent reward functions we de\ufb01ne in Section 3 form a convex set and the adversary\u2019s pure\nstrategies are the extreme points of this convex set. Unfortunately, both the agent and the adversary\nmay have exponentially many pure strategies, which are hard to describe explicitly. This makes\nsolving the two-player zero-sum game challenging. Using tools from combinatorial optimization,\nwe manage to construct separation oracles for both the agent\u2019s and the adversary\u2019s set of policies\nusing the MDP solver as a subroutine. With the separation oracles, we can solve the maxmin learning\nproblem in polynomial time using the ellipsoid method.\nTheorem 1 provides a polynomial time algorithm, but as it heavily relies on the ellipsoid method,\nit is computationally expensive to run in practice. We propose another algorithm (Algorithm 3)\nbased on the online learning algorithm \u2013 followed-the-perturbed-leader (FPL), and show that after T\niterations the algorithm computes a policy that is at most O (1/pT ) away from the true maxmin policy\n(Theorem 2). Moreover, each iteration of our algorithm is polynomial time. Notice that many other\nlow-regret learning algorithms, such as the multiplicative weights update method (MWU), are not\nsuitable for our problem. The MWU requires explicitly maintaining a weight for every pure strategy\nand updates them in every iteration, resulting in an exponential time algorithm for our problem.\nFurthermore, we show that Algorithm 3 still has similar performance when we only have a fully\npolynomial time approximation scheme (FPTAS) for solving the MDP. The formal statement and\nproof are postponed to the supplemental material due to space limit.\n\n1.1 Related Work\n\nIn the sense of using expert demonstrations, our work is related to inverse reinforcement learning\n(IRL) and apprenticeship learning [18, 1, 21, 20]. In particular, the apprenticeship learning problem\naims to learn a policy that performs at least as well as the expert\u2019s policy on all basis rewards, and\ncan also be formulated as a maxmin problem [21, 20]. Despite the seemingly similarity, our maxmin\nlearning problem aims to solve a completely different problem than apprenticeship learning. Here is\na simple example: consider in a gridworld, there are two basis rewards, w1 and w2, and there are\nonly two routes/policies \u2013 top and bottom. The expert takes the top route, getting 100 under w1 and\n70 under w2. Alternatively, taking the bottom route gets 90 under both w1 and w2. Apprenticeship\nlearning will return the top route, because taking the alternative route performs worse than the expert\nunder w1 and violates the requirement. What is our solution? Assuming \u270f = 25 2, both w1 and w2\n(or any convex combination of them) are consistent with the expert demonstration. If we choose the\ntop route, the worst case performance (under w2 in this case) is 70, while the correct maxmin solution\nto our problem is to take the bottom route so that its worst performance is 90. In the worst case\n(under w2), our maxmin policy has better guarantees and thus is more robust. Unlike apprenticeship\nlearning/IRL, we do not want to mimic the experts or infer their rewards, but we want to produce\na policy with robustness guarantees by leveraging their data. As a consequence, our results are\napplicable to settings where the training and testing environments are different (as discussed in the\nIntroduction). Moreover, our formulation allows us to combine multiple expert demonstrations.\nInverse reward design [10] uses a proxy reward and infers the true reward by estimating its posterior.\nThen it uses risk-averse planning together with samples from the posterior in the testing environment\nto achieve safe exploration. Our approach achieves a similar goal without assuming any distribution\nover the rewards and is arguably more robust. We apply a single reward function to the whole MDP\nwhile they apply (maybe too pessimistically) per step/trajectory maxmin planning. Furthermore,\nour algorithm is guaranteed to \ufb01nd the maxmin solution in polynomial time, and can naturally\naccommodate multiple experts.\nIn repeated IRL [2], the agent acts on the behalf of a human expert in a variety of tasks, and the\nhuman expert corrects the agent when the agent\u2019s policy is far from the optimum. The goal is to\nminimize the number of corrections from the expert, and they provide an upper bound on the number\nof corrections by reducing the problem to the ellipsoid method. Their model requires continuous\ninteraction with an expert while our model only assumes the availability of one or a couple expert\n\n2See Section 3 for the formal de\ufb01nition of consistent rewards. Intuitively, it means that the expert\u2019s policy\n\nyields a reward that is within \u270f of the optimal possible reward.\n\n2\n\n\fpolicies prior to training. Furthermore, we aim to \ufb01nd a maxmin optimal policy, while their paper\nfocuses on minimizing the number of corrections needed.\nRobust Markov Decision Processes [19, 12] have addressed the problem of performing dynamic\nprogramming-style optimization environments in which the transition probability matrix is uncertain.\nLim, Xu & Mannor [16] have extended this idea to reinforcement learning methods. This body of\nwork also uses min-max optimization, but because the optimization is with respect to worst-case\ntransitions, this line of work results in very pessimistic policies. Our algorithmic approach and\n\ufb02avor of results are also different. [17] have addressed a similar adversarial setup, but in which the\nenvironment designs a worst-case disturbance to the dynamics of the agent, and have addressed this\nsetup using H1 control.\nPaper Organization: We introduce the notations and de\ufb01ne the maxmin learning problem in\nSection 2. We provide three different ways to de\ufb01ne the set of consistent reward functions in\nSection 3, and present the ellipsoid-based exact algorithm and its analysis in Section 4.1. The FPL-\nbased algorithm and its analysis are in Section 4.2, followed by experimental results in Section 5.\n\n2 Preliminary\n\nAn MDP is a tuple M = (S,A, Psa,, D, R ), including a \ufb01nite set of states, S, a set of actions, A,\nand transition probabilities, Psa.  is a discount factor, and D is the distribution of initial states. The\nreward function R instructs the learning process. We assume that the reward is a linear function of\nsome vector of features : S! [0, 1]k over states. That is R(s) = w \u00b7 (s) for every state s 2S ,\nwhere w 2 Rk is the reward weights of the MDP. The true reward weights w\u21e4 is assumed to be\nunknown to the agent. We use h\u00b7i to denote the bit complexity of an object. In particular, we use hMi\nto denote the bit complexity of M, which is the number of bits required to represent the distribution\nof initial states, transition probabilities, the discount factor , and the rewards at all the states. We use\nthe notation M\\R to denote a MDP without the reward function, and hM\\Ri is its bit complexity.\nWe further assume that (s) can be represented using at most hi bits for any state s 2S .\nAn agent selects the action according to a policy \u21e1. The value of a policy under rewards\nw is Es0\u21e0D[V \u21e1(s0)|M ] = w \u00b7 E[P1t=0 t(st)|M, \u21e1].\nIt is expressed as the weights multi-\nplied by the accumulated discounted feature value given a policy, which we de\ufb01ne as (\u21e1) =\nE[P1t=0 t(st)|M, \u21e1].\nMDP solver We assume that there is a RL algorithm ALG that takes an MDP as input and outputs\nan optimal policy and its corresponding representation in the feature space. In particular, ALG(M )\noutputs (\u21e1\u21e4, \u00b5\u21e4) such that Es0\u21e0D[V \u21e1\u21e4(s0)|M ] = max\u21e1 Es0\u21e0D[V \u21e1(s0)|M ] and \u00b5\u21e4 = ( \u21e1\u21e4).\nMaxmin Learning All weights that are consistent with the agent\u2019s knowledge form a set PR. We\nwill discuss several formal ways to de\ufb01ne this set in Section 3. The goal of the agent is to learn a policy\nthat maximizes the reward for any reward function that could be induced by weights in PR and adver-\nsarially chosen. More speci\ufb01cally, the max-min learning problem is max\u00b52PF minw2PR wT \u00b5,\nwhere PF is the polytope that contains the representations of all policies in the feature space, i.e.\nPF = {\u00b5 | \u00b5 = ( \u21e1) for some policy \u21e1 }. WLOG, we assume that all weights lie in [1, 1]k.\nSeparation Oracles To perform maxmin learning, we often need to optimize linear functions over\nconvex sets that are intersections of exponentially many halfspaces. Such optimization problem is\nusually intractable, but if the convex set permits a polynomial time separation oracle, then there\nexists polynomial time algorithms (e.g. ellipsoid method) that optimize linear functions over it.\nDe\ufb01nition 1. (Separation Oracle) Let P be a closed, convex subset of Euclidean space Rd. Then a\nSeparation Oracle for P is an algorithm that takes as input a point x 2 Rd and outputs \u201cYES\u201d if\nx 2 P , or a hyperplane (w, c) such that w \u00b7 y \uf8ff c for all y 2 P , but w \u00b7 x > c. Note that because P\nis closed and convex, such a hyperplane always exists whenever x /2 P .\n3 Consistent Reward Polytope\n\nIn this section, we discuss several ways to de\ufb01ne the consistent reward polytope PR.\n\n3\n\n\fExplicit Description We assume that the agent knows that the weights satisfy a set of explicitly\nde\ufb01ned linear inequalities of the form c \u00b7 w  b. For example, such an inequality can be learned by\nobserving that a particular policy yields a reward that is larger or smaller than a certain threshold. 3\nImplicitly Speci\ufb01ed by an Expert Policy Usually, it may not be easy to obtain many explicit\ninequalities about the weights. Instead, we may have observed a policy \u21e1E used by an expert. We\nfurther assume that the expert\u2019s policy has a reasonably good performance under the true rewards\nw\u21e4. Namely, \u21e1E\u2019s expected reward is only \u270f less than the optimal one. Let the expert\u2019s feature vector\n\u00b5E = ( \u21e1E). The set PR therefore contains all w such that \u00b5E \u00b7 w  \u00b5T \u00b7 w  \u270f,8\u00b5 2 PF . It is\nnot hard to verify that under this de\ufb01nition PR is a convex set. Even though explicitly specifying\nPR is extremely expensive as there are in\ufb01nitely many \u00b5 2 PF , we can construct a polynomial time\nseparation oracle SOR (Algorithm 1). An alternative way to de\ufb01ne PR is to assume that the expert\npolicy can achieve (1 \u270f) of the optimal reward (assuming the \ufb01nal reward is positive). We can again\ndesign a polynomial time separation oracle similar to Algorithm 1.\n\nAlgorithm 1 Separation Oracle SOR for the reward polytope PR\ninput w0 2 Rk\n1: Let \u00b5w0 := argmax\u00b52PF \u00b5 \u00b7 w0. Notice that \u00b5w0 is the feature vector of the optimal policy under\n2: if \u00b5w0 \u00b7 w0 > \u00b5E \u00b7 w0 + \u270f then\n3:\n\nreward weights w0. Hence, it can be computed by our MDP solver ALG.\n\noutput \u201cNO\u201d , and (\u00b5E  \u00b5w0) \u00b7 w + \u270f  0 as the separating hyperplane, since for all\nw 2 PR, \u00b5E \u00b7 w  \u00b5w0 \u00b7 w  \u270f.\noutput \u201cYES\u201d.\n\n4: else\n5:\n6: end if\n\nCombining Multiple Experts How can we combine demonstrations from experts operating in\ndrastically different environments? Here is our model. For each environment i, there is a separate\nMDP Mi, and all the MDPs share the same underlying weights as they are all about completing the\nsame task although in different environments. The i-th expert\u2019s policy is nearly optimal in Mi. More\nspeci\ufb01cally, for expert i, her policy \u21e1Ei is at most \u270fi less than the optimal policy in Mi. Therefore,\neach expert i provides a set of constraints that any consistent reward needs to satisfy, and PR is the\nset of rewards that satisfy all constraints imposed by the experts. For each expert i, we can design a\nseparation oracle SO(i)\nR (similar to Algorithm 1) accepting weights that respect the constraints given\nby expert i\u2019s policy. We can easily design a separation oracle for PR that only accepts weights that\nwill be accepted by all separation oracles SO(i)\nR .\nFrom now on, we will not distinguish between different ways to de\ufb01ne and access the consistent\nreward polytope PR, but simply assume that we have a polynomial time separation oracle for it. All\nthe algorithms we design in this paper only require access to this separation oracle. In Section 5, we\nwill specify how the PR is de\ufb01ned for each experiment.\n\n4 Maxmin Learning using an Exact MDP Solver\n\nIn this section, we show how to design maxmin learning algorithms. Our algorithm only interacts\nwith the MDP through the MDP solver, which can be either model-based or model-free. Our \ufb01rst\nalgorithm solves the maxmin learning problem exactly using the ellipsoid method. Despite the\nfact that the ellipsoid method has provable worst-case polynomial running time, it is known to be\ninef\ufb01cient sometimes in practice. Our second algorithm is an ef\ufb01cient iterative method based on the\nonline learning algorithm \u2013 follow-the-perturbed-leader (FPL).\n\n4.1 Ellipsoid-Method-Based Solution\nTheorem 1. Given a polynomial time separation oracle SOR for the consistent reward polytope PR\nand an exact polynomial time MDP solver ALG, we have a polynomial time algorithm such that\n3Note that with a polynomial number of trajectories, one can apply standard Chernoff bounds to derive\nsuch inequalities that hold with high probability. It is often the case that the probability is so close to 1 that the\ninequality can be treated as true always for any practical purposes.\n\n4\n\n\fmax z\n\n\u00b5 2 PF\n\nFigure 1: Maxmin Learning LP.\n\nsubject to z \uf8ff \u00b5 \u00b7 w, 8w 2 PR\n\nfor any MDP without the reward function M\\R, the algorithm computes the maxmin policy \u21e1\u21e4 with\nrespect to M\\R and PR.\nThe plan is to \ufb01rst solve the maxmin learning\nproblem in the feature space then convert it back\nto the policy space. Solving the maxmin learn-\ning problem in the feature space is equivalent to\nsolving the linear program in Figure 1.\nThe challenges for solving the LP are that (i) it\nis not clear how to check whether \u00b5 lies in the polytope PF , and (ii) there are seemingly in\ufb01nitely\nmany constraints of the type z \uf8ff \u00b5 \u00b7 w as there are in\ufb01nitely many w 2 PR. Next, we show that\ngiven an exact MDP solver ALG, we can design a polynomial time separation oracle for the set of\nfeasible variables (\u00b5, z) of LP 1. With this separation oracle, we can apply the ellipsoid method (see\nTheorem 3 in the supplementary material) to solve LP 1 in polynomial time.\nFirst, we design a separation oracle for polytope PF by invoking a seminal result from optimization \u2013\nthe equivalence between separation and optimization.\nLemma 1 (Separation \u2318 Optimization).\nb}2 Rd and the following two problems:\nLinear Optimization: given a linear objective c 2 Rd, compute x\u21e4 2 argmaxx2P c \u00b7 x\nSeparation: given a point y 2 Rd, decide that y 2 P , or else \ufb01nd h 2 Rd s.t. h\u00b7 x < h\u00b7 y, 8x 2 P .\nthen the separation problem is solvable\nIf P can be described implicitly using hPi bits,\nin poly(hPi, d,hyi) time for P if and only if the linear optimization problem is solvable in\npoly(hPi, d,hci) time.\nIt is not hard to see that if one can solve the separation problem, one can construct a separation oracle\nin polynomial time and apply the ellipsoid method to solve the linear optimization problem. The\nless obvious direction in the result above states that if one can solve the linear optimization problem,\none can also use it to construct a separation oracle. The equivalence between these two problems\nturns out to have profound implications in combinatorial optimization and has enabled numerous\npolynomial time algorithms for many problems that are dif\ufb01cult to solve otherwise.\n\n[9, 14] Consider any convex polytope P = {x : Ax \uf8ff\n\nAlgorithm 2 Separation Oracle for the feasible (\u00b5, z) in LP 1\ninput (\u00b50, z0) 2 Rk+1\n1: Query SOF (\u00b50).\n2: if \u00b50 /2 PF then\n3:\n4: else\n5:\n\noutput \u201cNO\u201d and output the same separating hyperplane as outputted by SOF (\u00b50).\nLet w\u21e4 2 argminw2PR \u00b50 \u00b7 w and V = \u00b50 \u00b7 w\u21e4. This requires solving a linear optimization\nproblem over PR using the ellipsoid method with the separation oracle SOR.\nif z0 \uf8ff V then\noutput \u201cYES\u201d\nelse\noutput \u201cNO\u201d, and a separating hyperplane z \uf8ff \u00b5 \u00b7 w\u21e4, as z0 > \u00b50 \u00b7 w\u21e4 and all feasible\nsolutions of LP 1 respect this constraint.\n\n6:\n7:\n8:\n9:\n\nend if\n\n10:\n11: end if\n\nOur goal is to design a polynomial time separation oracle for the polytope PF . The key observation is\nthat the linear optimization problem over polytope PF : max\u00b52PF w \u00b7 \u00b5 is exactly the same as solving\nthe MDP with reward function R(\u00b7) = w \u00b7 (\u00b7). Therefore, we can use the MDP solver to design a\npolynomial time separation oracle for PF .\nLemma 2. Given access to an MDP solver ALG that solves any MDP M in time polynomial in\nhMi, we can design a separation oracle SOF for PF that runs in time polynomial in hM\\Ri, hi, k,\nand the bit complexity of the input 4.\n4Note that SOF only depends on the bit complexity of M\\R, but not the actual model of M\\R such as the\ndistributions of the initial states or the transition probabilities. We only require access to ALG and an upper\nbound of hM\\Ri.\n\n5\n\n\fWith SOF , we \ufb01rst design a polynomial time separation oracle for checking the feasible (z, \u00b5) pairs\nin LP 1 (Algorithm 2). With the separation oracle, we can solve LP 1 using the ellipsoid method.\nThe last dif\ufb01culty is that the optimal solution only gives us the maxmin feature vector instead of the\ncorresponding maxmin policy. We use the following nice property of SOF to convert the optimal\nsolution in the feature space to the policy space. See Section 8 in the supplementary material for\nintuition behind Lemma 3.\nLemma 3. [9, 14, 7] If SOF (\u00b5) = \u201cYES\u201d, there exists a set, C, of weights w 2 Rk such that SOF\nhas queried the MDP solver ALG on reward function w \u00b7 (\u00b7) for every w 2 C. Let (\u21e1w, \u00b5w) be the\noutput of ALG on weight w, then \u00b5 lies in the convex hull of {\u00b5w|w 2 C}.\nProof of Theorem 1: It is not hard to see that Algorithm 2 is a valid polynomial time separation oracle\nfor the feasible (\u00b5, z) pairs in LP 1. Hence, we can solve LP 1 in polynomial time with the ellipsoid\nmethod with access to Algorithm 2. Next, we show how to convert the optimal solution \u00b5\u21e4 of LP 1\nto the corresponding maxmin optimal policy \u21e1\u21e4. Here, we invoke Lemma 3. We query SOF on \u00b5\u21e4\nand we record all weights w that SOF has queried the MDP solver ALG on. Let C = {w1, . . . , w`}\nbe all the queried weights. As SOF is a polynomial time algorithm, ` is also polynomial in the\ninput size. By Lemma 3, we know that \u00b5 is in the convex hull of ({\u00b5w|w 2 C}), which means there\nexists a set of nonnegative numbers p1, . . . , p`, such thatP`\ni=1 pi \u00b7 \u00b5wi.\nClearly, the discounted accumulated feature value of the randomized policyP`\ni=1 pi \u00b7 \u21e1wi equals\ntoP`\ni=1 pi \u00b7 \u00b5wi = \u00b5\u21e4. We can compute the pis in poly-time via linear\nprogramming andP`\n\ni=1 pi = 1 and \u00b5\u21e4 = P`\n\ni=1 pi \u00b7 (\u21e1wi) = P`\n\ni=1 pi \u00b7 \u21e1wi is the maxmin policy. 2\n\n4.2 Finding the Maxmin Policy using Follow the Perturbed Leader\nThe exact algorithm of Theorem 1 may be computationally expensive to run, as the separation oracle\nSOF requires running the ellipsoid method to answer every query, and on top of that we need to run\nthe ellipsoid method with queries to SOF . In this section, we propose a simpler and faster algorithm\nthat is based on the algorithm \u2013 follow-the-perturbed-leader (FPL) [13].\nTheorem 2. For any \u21e0 2 (0, 1/2), with probability at least 1  2\u21e0, Algorithm 3 \ufb01nds a policy \u21e1\nafter T rounds of iterations such that its expected reward under any weight from PR is at least\n. In every iteration, Algorithm 3 makes one query to\nmax\u00b52PF minw2PR \u00b5 \u00b7 w \nALG and O\u21e3k2\u21e3(log k)2 + ((b + hi)(|A||S| + k) + log T )2\u2318\u2318 queries to SOR, where b is an\n\nupper bound on the number of bits needed to specify the transition probability Psa for any state s\nand action a.\n\nk2\u21e36+4pln 1/\u21e0\u2318\n\npT\n\nFPL is a classical online learning algorithm that solves a problem where a series of decisions d1, d2, ...\nneed to be made. Each di is from a possibly in\ufb01nite set D \u2713 Rn. The state st 2 S \u2713 Rn at step\nt is observed after the decision dt. The goal is to have the total rewardPt dt \u00b7 st not far from the\nreward of the best \ufb01xed decision from D in hindsight, that is maxd2DPt d \u00b7 st. The FPL algorithm\nguarantees that after T rounds, the regretPt dt \u00b7 st  maxd2DPt d \u00b7 st scales linearly in pT . This\n\nguarantee holds for both oblivious and adaptive adversary, and the bound holds both in expectation\nand with high probability (see Theorem 4 in Section 8 of the supplementary material for the formal\nstatement).\nFPL falls into a large class of algorithms that are called low-regret algorithms, as the regret grows\nsub-linearly in T . It is well known that low-regret algorithms can be used to solve two-player zero-\nsum games approximately. The maxmin problem we face here can also be modeled as a two-player\nzero-sum games. One player is the agent whose strategy is a policy \u21e1, and the other player is the\nreward designer whose strategy is a weight w 2 PR. The agent\u2019s payoff is the reward that it collects\nusing policy \u21e1, which is (\u21e1) \u00b7 w, and the designer\u2019s payoff is  (\u21e1) \u00b7 w. Finding the maxmin\nstrategy for the agent is equivalent to \ufb01nding the maxmin policy. One challenge here is that the\nnumbers of strategies for both players are in\ufb01nite. Even if we only consider the pure strategies\nwhich correspond to the extreme points of PF and PR, there are still exponentially many of them.\nMany low-regret algorithms such as multiplicative-weights-update requires explicitly maintaining a\ndistribution over the pure strategies, and update it in every iteration. In our case, these algorithms\nwill take exponential time to \ufb01nish just a single iteration. This is the reason why we favor the FPL\nalgorithm, as the FPL algorithm only requires \ufb01nding the best policy giving the past weights, which\n\n6\n\n\fcan be done by the MDP solver ALG. We also show that a similar result holds even if we replace the\nexact MDP solver with an additive FPTAS [ALG. The proof of Theorem 2 can be found in Section 8\nin the supplementary material. Our generalization to cases where we only have access to [ALG is\npostponed to Section 9 in the supplementary material.\n\nAlgorithm 3 FPL Maxmin Learning\ninput T : the number of iterations\n1: Set  := 1/kpT.\n2: Arbitrarily pick some policy \u21e11, compute \u00b51 2 PF . Arbitrarily pick some reward weights w1,\nand set t = 1.\n3: while t \uf8ff T do\n4:\ni=1 wi + pt\u2318 \u00b7 (\u00b7), where pt is drawn uniformly from\n\nUse ALG to compute the optimal policy \u21e1t and \u00b5t = ( \u21e1t) that maximizes the expected\n\n[0, 1/]k.\n\ni=1 \u00b5t + qt), where qt is drawn uniformly from [0, 1/]k.\n\nreward under reward function\u21e3Pt1\nLet wt := argminw2PR wT (Pt1\nT \u00b7PT\n\n5:\n6:\n7: end while\n8: Output the randomized policy 1\n\nt := t + 1.\n\nt=1 \u21e1t.\n\n5 Experiments\n\nGridworld We use gridworlds in the \ufb01rst set of experiments. Each grid may have a different\n\"terrain\" type such that passing the grid will incur certain reward. For each grid, a feature vector (s)\ndenotes the terrain type, and the true reward can be expressed as R\u21e4 = w\u21e4 \u00b7 (s). The agent\u2019s goal\nis to move to a goal grid with maximal reward under the worst possible weights that are consistent\nwith the expert. In other words, the maxmin policy is a safe policy, as it avoids possible negative side\neffects [3]. In the experiments, we construct the expert policy \u21e1E in a small (10\u21e510) demonstration\ngridworld that contains a subset of the terrain types. One expert policy is provide, and the number\nof trajectories that we need to estimate the expert policy\u2019s cumulative feature follows the sample\ncomplexity analysis as in [21]. In the following experiment we set \u270f = 0.5 which de\ufb01nes PR and\ncaptures how close to optimal the expert is.\nAn example behavior is shown in Figure 3.\nThere are 5 possible terrain types. The expert\npolicy in Figure 3 (left) has only seen 4 terrain\ntypes. We compute the maxmin policy in the\n\"real-world\" MDP of a much larger size (50\u21e550)\nwith all 5 terrain types using Algorithm 3 with\nthe reward polytope PR implicitly speci\ufb01ed by\nthe expert policy. Figure 3 (middle) shows that\nour maxmin policy avoids the red-colored ter-\nrain that was missing from the demonstration\nMDP. To facilitate observation, Figure 3 (right)\nshows the same behavior by an agent trained in\na smaller MDP. Figure 2 compares the maxmin\npolicy to a baseline. The baseline policy is com-\nputed in an MDP whose reward weights are the\nsame as the demonstration MDP for the \ufb01rst\nfour terrain types and the \ufb01fth terrain weight is\nchosen at random. Our maxmin policy is much\nsafer than the baseline as it completely avoids\nthe \ufb01fth terrain type. It also imitates the expert\u2019s\nbehavior by favoring the same terrain types.\nWe also implemented the maxmin method in\ngridworlds with a stochastic transition model.\nThe maxmin policy (see Figure 8 in Section 10\n\nFigure 2: Experiment results comparing our\nmaxmin policy to a baseline. The baseline was\ncomputed with a random reward for the \ufb01fth ter-\nrain and the other four terrain rewards set the same\nas the demonstration MDP. Our maxmin policy\nis much safer than the baseline as it completely\navoids traversing the \ufb01fth (unknown) terrain type.\nIt should also be noticed that the maxmin policy\nlearns from the expert policy while achieving the\ngoal of avoiding potential negative side effects,\nas the fraction of trajectory of each terrain type\nclosely resemble the expert.\n\n7\n\n\fof the supplementary material) is more conservative comparing to the deterministic model, and\nchooses paths that are further away from any unknown terrains. More details and computation time\ncan be found in the supplementary material.\nIt should be noted that the training and testing MDPs are different. More speci\ufb01cally, the red\nterrain type is missing from the expert demonstration, and the testing MDP is of a larger size. As\ndiscussed in the Introduction, our formulation allows the testing MDP in which the agent operates\nto be different from the training MDP in which the expert demonstrates, as long as the two MDPs\nshare the same feature space. All of our experiments have this property. To the limit of our\nknowledge, apprenticeship learning requires the training and testing MDPs to be the same, thus a\ndirect comparison is not possible. For example, in the gridworld experiments, one has to explicitly\nassign a reward to the \"unknown\" feature in order to apply apprenticeship learning, which may cause\nthe problem of reward misspeci\ufb01cation and negative side effects. Our maxmin solution is robust to\nsuch issues.\n\nFigure 3: An example of maxmin policy in gridworlds. Left: an expert policy in the small demon-\nstration MDP, where 4 of 5 terrain types were used and their weights were randomly chosen. The\nexpert policy guides moving towards the yellow goal grid while preferring the terrains with higher\nrewards (light blue and light green). Middle: when faced with terrain types (red-colored) that the\nexpert policy never experienced, maxmin policy avoids such terrains and the accompanying negative\nside effects. The agent learns to operate in a larger (50\u21e550) grid world. Right: an agent in a smaller\nMDP to facilitate observation.The maxmin policy generates two possible trajectories.\n\nFigure 4: Modi\ufb01ed cartpole task with\ntwo additional features \u2013 questions\nblocks to either side of the center. The\nrewards associated with passing these\nblocks are not provided to the agent.\n\nCartPole Our next experiments are based on the classic\ncontrol task of cartpole and the environment provided by\nOpenAI Gym [6]. While we can only solve the problem\napproximately using model-free learning methods, our\nexperiments show that our FPL-based algorithm can learn\na safe policy ef\ufb01ciently for a continuous task. Moreover, if\nprovided with more expert policies, our maxmin learning\nmethod can easily accomodate and learn from multiple\nexperts.\nWe modify the cartpole problem by adding two possible\nfeatures to the environment as the two question blocks\nshown in Figure 4, and more details in the supplementary\nmaterial. The agent has no idea of what consequences passing these two blocks may have. Instead of\nknowing the rewards associated with these two blocks, we have expert policies from two other related\nscenarios. The \ufb01rst expert policy (Expert A) performs well in scenario A where only the blue block to\nthe left of the center is present, and the second expert policy (Expert B) performs well in scenario B\nwhere only the yellow block to the right of the center is present. The behavior of expert policies in a\ndefault scenario (without any question blocks), and scenarios A and B are shown in Figure 5. It is\nobvious that comparing with the default scenario, the expert policies in the other two scenarios prefer\nto travel to the right side. Intuitively, it seems that the blue block incurs negative effects while the\nyellow block is either neutral or positive.\nNow we train the agent in the presence of both question blocks. First, we provide the agent with\nExpert A policy alone, and learn a maxmin policy. The maxmin policy\u2019s behavior is shown in Figure 6\n(top). It tries to avoid both question blocks since it observes that Expert A avoids the blue block and\nit has no knowledge of the yellow block. Then, we provide both Expert A and Expert B to the\nagent, and the resulting maxmin policy guides movement in a wider range extending to the right of\n\n8\n\n\fFigure 5: Behavior examples of different poli-\ncies. Occupancy is de\ufb01ned as the number of\nsteps appearing at a location divided by the to-\ntop: In the default setting without\ntal steps.\nany question blocks, the travel range is relatively\nsymmetric around the center of the \ufb01eld. mid:\nIn the presence of the blue question block to the\nleft, an expert policy A guides movements to\nthe right. bottom: In scenario B where only the\nyellow question block is present, expert policy\nB also guides movement to the right.\n\nFigure 6: Maxmin policy learnt with different\nexpert policies. top: Given Expert A policy only,\nthe agent learns to stay within a narrow range\nnear slightly right to the center to avoid both\nquestion blocks. Because the agent has no knowl-\nedge about the yellow block, a maxmin policy\navoids it. bottom: When given both Expert A\nand Expert B policies, the agent learns that it\nis safe to pass the yellow block, so the range is\nwider and extends more to the right comparing\nto the maxmin policy learnt from Expert A alone.\n\nthe \ufb01eld as shown in Figure 6 (bottom). This time, our maxmin policy also learns from Expert B that\nthe yellow block is not harmful. The experiment demonstrates that our maxmin method works well\nwith complex reinforcement learning tasks where only approximate MDP solvers are available.\n\n6 Discussion\n\nIn this paper, we provided a theoretical treatment of the problem of reinforcement learning in the\npresence of mis-speci\ufb01cations of the agent\u2019s reward function, by leveraging data provided by experts.\nThe posed optimization can be solved exactly in polynomial-time by using the ellipsoid methods,\nbut a more practical solution is provided by an algorithm which takes a follow-the-perturbed-leader\napproach. Our experiments illustrate the fact that this approach can successfully learn robust policies\nfrom imperfect expert data, in both discrete and continuous environments. It will be interesting to see\nwhether our maxmin formulation can be combined with other methods in RL such as hierarchical\nlearning to produce robust solutions in larger problems.\n\n7 Acknowledgement\n\nDoina Precup and Jessie Huang gratefully acknowledge funding from Open Philanthropy Fund\nand NSERC which made this research possible. Yang Cai and Fa Wu thank the NSERC for its\nsupport through the Discovery grant RGPIN2015-06127 and FRQNT for its support through the\ngrant 2017-NC-198956.\n\n9\n\n\fReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\nIn Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM,\n2004.\n\n[2] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In\n\nAdvances in Neural Information Processing Systems, pages 1813\u20131822, 2017.\n\n[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[4] Aharon Ben-Tal, Elad Hazan, Tomer Koren, and Shie Mannor. Oracle-based robust optimization\n\nvia online learning. Operations Research, 63(3):628\u2013638, 2015.\n\n[5] Stephen Boyd and Lieven Vandenberghe. Localization and cutting-plane methods. From\n\nStanford EE 364b lecture notes, 2007.\n\n[6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[7] Yang Cai, Constantinos Daskalakis, and S. Matthew Weinberg. Reducing Revenue to Welfare\nMaximization : Approximation Algorithms and other Generalizations. In the 24th Annual\nACM-SIAM Symposium on Discrete Algorithms (SODA), 2013.\n\n[8] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[9] Martin Gr\u00f6tschel, L\u00e1szl\u00f3 Lov\u00e1sz, and Alexander Schrijver. The Ellipsoid Method and its\n\nConsequences in Combinatorial Optimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[10] Dylan Had\ufb01eld-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse\nreward design. In Advances in Neural Information Processing Systems, pages 6768\u20136777, 2017.\n[11] Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader.\n\nJournal of Machine Learning Research, 6(Apr):639\u2013660, 2005.\n\n[12] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research,\n\n30(2):257\u2013280, 2005.\n\n[13] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71(3):291\u2013307, 2005.\n\n[14] Richard M. Karp and Christos H. Papadimitriou. On linear characterizations of combinatorial\n\noptimization problems. SIAM J. Comput., 11(4):620\u2013632, 1982.\n\n[15] Leonid G. Khachiyan. A Polynomial Algorithm in Linear Programming. Soviet Mathematics\n\nDoklady, 20(1):191\u2013194, 1979.\n\n[16] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision\n\nprocesses. In Advances in Neural Information Processing Systems, pages 701\u2013709, 2013.\n\n[17] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335\u2013\n\n359, 2005.\n\n[18] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml,\n\npages 663\u2013670, 2000.\n\n[19] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[20] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear\nprogramming. In Proceedings of the 25th international conference on Machine learning, pages\n1032\u20131039. ACM, 2008.\n\n[21] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In\n\nAdvances in neural information processing systems, pages 1449\u20131456, 2008.\n\n10\n\n\f", "award": [], "sourceid": 5456, "authors": [{"given_name": "Jessie", "family_name": "Huang", "institution": "McGill University"}, {"given_name": "Fa", "family_name": "Wu", "institution": "McGill"}, {"given_name": "Doina", "family_name": "Precup", "institution": "McGill University / DeepMind Montreal"}, {"given_name": "Yang", "family_name": "Cai", "institution": "McGill University"}]}