{"title": "Bootstrapping Apprenticeship Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 289, "page_last": 297, "abstract": "We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of state-action features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert's policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations.", "full_text": "Bootstrapping Apprenticeship Learning\n\nAbdeslam Boularias\n\nDepartment of Empirical Inference\n\nMax-Planck Institute for Biological Cybernetics\n\n72076 T\u00a8ubingen, Germany\n\nBrahim Chaib-Draa\n\nDepartment of Computer Science\n\nLaval University\n\nQuebec G1V 0A6, Canada\n\nabdeslam.boularias@tuebingen.mpg.de\n\nchaib@damas.ift.ulaval.ca\n\nAbstract\n\nWe consider the problem of apprenticeship learning where the examples, demon-\nstrated by an expert, cover only a small part of a large state space. Inverse Rein-\nforcement Learning (IRL) provides an ef\ufb01cient tool for generalizing the demon-\nstration, based on the assumption that the expert is maximizing a utility function\nthat is a linear combination of state-action features. Most IRL algorithms use a\nsimple Monte Carlo estimation to approximate the expected feature counts under\nthe expert\u2019s policy. In this paper, we show that the quality of the learned policies\nis highly sensitive to the error in estimating the feature counts. To reduce this\nerror, we introduce a novel approach for bootstrapping the demonstration by as-\nsuming that: (i), the expert is (near-)optimal, and (ii), the dynamics of the system\nis known. Empirical results on gridworlds and car racing problems show that our\napproach is able to learn good policies from a small number of demonstrations.\n\n1\n\nIntroduction\n\nModern robots are designed to perform complicated planning and control tasks, such as manipulat-\ning objects, navigating in outdoor environments, and driving in urban settings. Unfortunately, man-\nually programming these tasks is almost infeasible in practice due to their high number of states.\nMarkov Decision Processes (MDPs) provide an ef\ufb01cient tool for handling such tasks with a little\nhelp from an expert. The expert\u2019s help consists in simply specifying a reward function. However, in\nmany practical problems, even specifying a reward function is not easy. In fact, it is often easier to\ndemonstrate examples of a desired behavior than to de\ufb01ne a reward function (Ng & Russell, 2000).\nLearning policies from demonstration, a.k.a. apprenticeship learning, is a technique that has been\nwidely used in robotics. An ef\ufb01cient approach to apprenticeship learning, known as Inverse Re-\ninforcement Learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004), consists in recovering a\nreward function under which the policy demonstrated by an expert is near-optimal, rather than di-\nrectly mimicking the expert\u2019s actions. The learned reward is then used for \ufb01nding an optimal policy.\nConsequently, the expert\u2019s actions can be predicted in states that have not been encountered during\nthe demonstration. Unfortunately, as already pointed by Abbeel & Ng (2004), recovering a reward\nfunction is an ill-posed problem. In fact, the expert\u2019s policy can be optimal under an in\ufb01nite number\nof reward functions. Most of the work on apprenticeship learning via IRL focused on solving this\nparticular problem by using different types of regularization and loss cost functions (Ratliff et al.,\n2006; Ramachandran & Amir, 2007; Syed & Schapire, 2008; Syed et al., 2008).\nIn this paper, we focus on another important problem occurring in IRL. IRL-based algorithms rely on\nthe assumption that the reward function is a linear combination of state-action features. Therefore,\nthe value function of any policy is a linear combination of the expected discounted frequency (count)\nof encountering each state-action feature. In particular, the value function of the expert\u2019s policy is\napproximated by a linear combination of the empirical averages of the features, estimated from\nthe demonstration (the trajectories). In practice, this method works ef\ufb01ciently only if the number\n\n1\n\n\fof examples is suf\ufb01ciently large to cover all the states, or the dynamics of the system is nearly\ndeterministic. For the tasks related to systems with a stochastic dynamics and a limited number of\navailable examples, we propose an alternative method for approximating the expected frequencies\nof the features under the expert\u2019s policy. Our approach takes advantage of the fact that the expert\u2019s\npartially demonstrated policy is near-optimal, and generalizes the expert\u2019s policy beyond the states\nthat appeared in the demonstration. We show that this technique can be ef\ufb01ciently used to improve\nthe performance of two known IRL algorithms, namely Maximum Margin Planning (MMP) (Ratliff\net al., 2006), and Linear Programming Apprenticeship Learning (LPAL) (Syed et al., 2008).\n\n2 Preliminaries\nFormally, a \ufb01nite-state Markov Decision Process (MDP) is a tuple (S,A,{T a}, R, \u03b1, \u03b3), where: S\nis a set of states, A is a set of actions, T a is a transition matrix de\ufb01ned as \u2200s, s(cid:48) \u2208 S, a \u2208 A :\nT a(s, s(cid:48)) = P r(st+1 = s(cid:48)|st = s, at = a), R is a reward function (R(s, a) is the reward associ-\nated with the execution of action a in state s), \u03b1 is the initial state distribution, and \u03b3 is a discount\nfactor. We denote by MDP\\R a Markov Decision Process without a reward function, i.e. a tuple\n(S,A,{T a}, \u03b1, \u03b3). We assume that the reward function R is given by a linear combination of k\ni=0 wifi(s, a). A deterministic\npolicy \u03c0 is a function that returns an action \u03c0(s) for each state s. A stochastic policy \u03c0 is a probabil-\nity distribution on the action to be executed in each state, de\ufb01ned as \u03c0(s, a) = P r(at = a|st = s).\nThe value V (\u03c0) of a policy \u03c0 is the expected sum of rewards that will be received if policy \u03c0 will\nt=0 \u03b3tR(st, at)|\u03b1, \u03c0, T ]. An optimal policy \u03c0 is one satisfying\n\u03c0 = arg max\u03c0 V (\u03c0). The occupancy \u00b5\u03c0 of a policy \u03c0 is the discounted state-action visit distri-\nt=0 \u03b3t\u03b4st,s\u03b4at,a|\u03b1, \u03c0, T ] where \u03b4 is the Kronecker delta. We\na \u00b5\u03c0(s, a). The following linear constraints, known as Bellman-\ufb02ow\n\nfeature vectors fi with weights wi: \u2200s \u2208 S,\u2200a \u2208 A : R(s, a) =(cid:80)k\nbe followed, i.e. V (\u03c0) = E[(cid:80)\u221e\nbution, de\ufb01ned as: \u00b5\u03c0(s, a) = E[(cid:80)\u221e\nalso use \u00b5\u03c0(s) to denote(cid:80)\n(cid:88)\n(cid:88)\n{(cid:0)\u00b5\u03c0(s) = \u03b1(s) + \u03b3\n\n\u00b5\u03c0(s, a) = \u00b5\u03c0(s)(cid:1),(cid:0)\u00b5\u03c0(s, a) (cid:62) 0(cid:1)} (1)\n\nconstraints, are necessary and suf\ufb01cient for de\ufb01ning an occupancy measure of a policy:\n\n\u00b5\u03c0(s(cid:48), a)T a(s(cid:48), s)(cid:1),(cid:0)(cid:88)\n\na\u2208A\n\ns(cid:48)\u2208S\n\na\u2208A\n\nA policy \u03c0 is well-de\ufb01ned by its occupancy measure \u00b5\u03c0, one can interchangeably use \u03c0 and \u00b5\u03c0\nto denote a policy. The set of feasible occupancy measures is denoted by G. The frequency of a\nfeature fi for a policy \u03c0 is given by vi,\u03c0 = F (i, .)\u00b5\u03c0, where F is a k by |S||A| feature matrix, such\nthat F (i, (s, a)) = fi(s, a). Using this de\ufb01nition, the value of a policy \u03c0 can be written as a linear\nfunction of the frequencies: V (\u03c0) = wT F \u00b5\u03c0 = wT v\u03c0, where v\u03c0 is the vector of vi,\u03c0. Therefore,\nthe value of a policy is completely determined by the frequencies (or counts) of the features fi.\n\n3 Apprenticeship Learning\n\n3.1 Overview\n\nThe aim of apprenticeship learning is to \ufb01nd a policy \u03c0 that is at least as good as a policy \u03c0E\ndemonstrated by an expert, i.e. V (\u03c0) (cid:62) V (\u03c0E). The value functions of \u03c0 and \u03c0E cannot be\ndirectly compared, unless a reward function is provided. To solve this problem, Ng & Russell\n(2000) proposed to \ufb01rst learn a reward function, assuming that the expert is optimal, and then use it\nto recover the expert\u2019s complete policy. However, the problem of learning a reward function given an\noptimal policy is ill-posed (Abbeel & Ng, 2004). In fact, a large class of reward functions, including\nall constant functions for instance, may lead to the same optimal policy. To overcome this problem,\nAbbeel & Ng (2004) did not consider recovering a reward function, instead, their algorithm returns\na policy \u03c0 with a bounded loss in the value function, i.e. (cid:107) V (\u03c0) \u2212 V (\u03c0E) (cid:107)(cid:54) \u0001, where the value\nis calculated by using the worst-case reward function. This property is derived from the fact that\nwhen the frequencies of the features under two policies match, the cumulative rewards of the two\npolicies match as well, assuming that the reward is a linear function of these features. In the next two\nsubsections, we brie\ufb02y describe two algorithms for apprenticeship learning via IRL. The \ufb01rst one,\nknown as Maximum Margin Planning (MMP) (Ratliff et al., 2006), is a robust algorithm based on\nlearning a reward function under which the expert\u2019s demonstrated actions are optimal. The second\none, known as Linear Programming Apprenticeship Learning (LPAL) Syed et al. (2008), is a fast\nalgorithm that directly returns a policy with a bounded loss in the value.\n\n2\n\n\f3.2 Maximum Margin Planning\n\n(cid:16)\n\n(cid:17)q\n\nMaximum Margin Planning (MMP) returns a vector of reward weights w, such that the value of the\nexpert\u2019s policy wT F \u00b5\u03c0E is higher than the value of an alternative policy wT F \u00b5\u03c0 by a margin that\nscales with the number of expert\u2019s actions that are different from the actions of the alternative policy.\nThis criterion is explicitly speci\ufb01ed in the cost function minimized by the algorithm:\n\ncq(w) =\n\n\u00b5\u2208G (wT F + l)\u00b5 \u2212 wT F \u00b5\u03c0E\nmax\n\n(2)\nwhere q \u2208 {1, 2} de\ufb01nes the slack penalization, \u03bb is a regularization parameter, and l is a deviation\ncost vector, that can be de\ufb01ned as: l(s, a) = 1\u2212 \u03c0E(s, a). A policy maximizing the cost-augmented\nreward vector (wT F + l) is almost completely different from \u03c0E, since an additional reward l(s, a)\nis given for the actions that are different from those of the expert. This algorithm minimizes the\ndifference between the value divergence wT F \u00b5\u03c0E \u2212 wT F \u00b5 and the policy divergence l\u00b5.\nThe cost function cq is convex, but nondifferentiable. Ratliff et al. (2006) showed that cq can be\nminimized by using a subgradient method. For a given reward w, a subgradient gq\n\n(cid:107) w(cid:107)2\n\n+ \u03bb\n2\n\nw is given by:\n\nw = q\ngq\n\n(wT F + l)\u00b5+ \u2212 wT F \u00b5\u03c0E\n\nF \u2206w\u00b5\u03c0E + \u03bbw\n\n(3)\n\n(cid:17)q\u22121\n\n(cid:16)\n\nwhere \u00b5+ = arg max\u00b5\u2208G(wT F + l)\u00b5, and \u2206w\u00b5\u03c0E = \u00b5+ \u2212 \u00b5\u03c0E .\n\n3.3 Linear Programming Apprenticeship Learning\n\nLinear Programming Apprenticeship Learning (LPAL) is based on the following observation: if the\nreward weights are positive and sum to 1, then V (\u03c0) (cid:62) V (\u03c0E) + mini[vi,\u03c0 \u2212 vi,\u03c0E ], for any policy\n\u03c0. LPAL consists in \ufb01nding a policy that maximizes the margin mini[vi,\u03c0 \u2212 vi,\u03c0E ]. The maximal\nmargin is found by solving the following linear program:\n\nv\n\nmax\nv,\u00b5\u03c0\nsubject to\n\n\u2200i \u2208 {0, . . . , k \u2212 1} : v (cid:54)(cid:88)\n(cid:124)\n\ns\u2208S\n\n(cid:88)\n\na\u2208A\n\n(cid:88)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\na\u2208A\n\n\u00b5\u03c0(s, a)fi(s, a)\n\n\u00b5\u03c0E (s, a)fi(s, a)\n\n(4)\n\n(cid:88)\n\na\u2208A\n\n\u2212(cid:88)\n(cid:124)\n\ns\u2208S\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nvi,\u03c0E\n\n(cid:125)\n\n\u00b5\u03c0(s) = \u03b1(s) + \u03b3\n\n\u00b5\u03c0(s(cid:48), a)T (s(cid:48), a, s),\n\n\u00b5\u03c0(s, a) = \u00b5\u03c0(s), \u00b5\u03c0(s, a) (cid:62) 0\n\n(cid:123)(cid:122)\n\nvi,\u03c0\n\n(cid:88)\n\na\u2208A\n\n(cid:80)\n\n\u00b5\u03c0(s, a)\na(cid:48)\u2208A \u00b5\u03c0(s, a(cid:48))\n\nThe last three constraints in this linear program correspond to the Bellman-\ufb02ow constraints (Equa-\ntion (1)) de\ufb01ning G, the feasible set of \u00b5\u03c0. The learned policy \u03c0 is given by:\n\n\u03c0(s, a) =\n\n3.4 Approximating feature frequencies\n\nM(cid:88)\n\nH(cid:88)\n\nm=1\n\nt=1\n\n\u02c6vi,\u03c0E =\n\n1\nM\n\ndef= F (i, .)\u00b5\u03c0E .\nNotice that both MMP and LPAL require the knowledge of the frequencies vi,\u03c0\u03c0E\nThese frequencies can be analytically calculated (using Bellman-\ufb02ow constraints) only if \u03c0E is com-\nH , ),\npletely speci\ufb01ed. Given a sequence of M demonstrated trajectories tm = (sm\nthe frequencies vi,\u03c0E are estimated as:\n\n1 , . . . , sm\n\nH , am\n\n1 , am\n\n\u03b3tfi(sm\n\nt )\nt , am\n\n(5)\n\nThere are nevertheless many problems related to this approximation. First, the estimated frequencies\n\u02c6vi,\u03c0E can be very different from the true ones when the demonstration trajectories are scarce. Sec-\nond, the frequencies \u02c6vi,\u03c0E are estimated for a \ufb01nite horizon H, whereas the frequencies vi,\u03c0 used in\nthe objective function (Equations (2) and (4)), are calculated for an in\ufb01nite horizon (Equation (1)).\nIn practice, these two values are too different and cannot be compared as done in these cost func-\ntions. Finally, the frequencies vi,\u03c0E are a function of both a policy and the transition probabilities,\nthe empirical estimation of vi,\u03c0E does not take advantage of the known transition probabilities.\n\n3\n\n\f4 Reward loss in Maximum Margin Planning\n\nV\n\nVl\n\nVl \u2212 \u02c6v\u03c0E\n\nVl \u2212 v\u03c0E\n\nTo show the effect of the error in the\nestimated feature frequencies on the\nquality of the learned rewards, we\npresent an analysis of the distance be-\ntween the vector of reward weights\n\u02c6w returned by MMP with estimated\nfrequencies \u02c6v\u03c0E = F \u02c6\u00b5\u03c0E , calculated\nfrom the examples by using Equa-\ntion (5), and the vector wE returned\nby MMP with accurate frequencies\nv\u03c0E = F \u00b5\u03c0E , calculated by using\nEquations (1) with the full policy \u03c0E.\nWe adopt\nthe following notations:\n\u2206v\u03c0 = \u02c6v\u03c0E \u2212 v\u03c0E , \u2206w = \u02c6w \u2212 wE,\nand Vl(w) = max\u00b5\u2208G(wT F + l)\u00b5,\nand we consider q = 1. The fol-\nlowing proposition shows how the re-\nward error \u2206w is related to the fre-\nquency error \u2206v\u03c0. Due to the fact\nthat the cost function of MMP is piecewise de\ufb01ned, one cannot \ufb01nd a closed-form relation be-\ntween \u2206w and \u2206v\u03c0. However, we show that for any \u02c6w \u2208 Rk, there is a monotonically decreasing\nfunction f such that for any \u0001 \u2208 R+, if (cid:107) \u2206v\u03c0 (cid:107)2< f(\u0001) then (cid:107) \u2206w (cid:107)2(cid:54) \u0001.\nProposition 1 Let \u0001 \u2208 R+, if \u2200w \u2208 Rk, such that (cid:107) w \u2212 \u02c6w (cid:107)2= \u0001, if the following condition is\nveri\ufb01ed:\n\nFigure 1: Reward loss in MMP with approximate frequen-\ncies \u02c6v\u03c0E . We indicate by v\u03c0E (resp. \u02c6v\u03c0E ) the linear function\nde\ufb01ned by the vector v\u03c0E (resp. \u02c6v\u03c0E ).\n\nwE\n\n\u02c6w\n\nv\u03c0E\n\n\u02c6v\u03c0E\n\n(cid:107) \u2206v\u03c0 (cid:107)2<\n\nVl(w) \u2212 Vl( \u02c6w) + ( \u02c6w \u2212 w)T \u02c6v\u03c0E + \u03bb\n\n2 ((cid:107) w (cid:107)2 \u2212 (cid:107) \u02c6w (cid:107)2)\n\n\u0001\n\nthen (cid:107) \u2206w (cid:107)2(cid:54) \u0001.\nProof The condition stated in the proposition implies:\n\n(cid:107) \u02c6w \u2212 w (cid:107)2(cid:107) \u2206v\u03c0 (cid:107)2< Vl(w) \u2212 Vl( \u02c6w) + ( \u02c6w \u2212 w)T \u02c6v\u03c0E + \u03bb((cid:107) w (cid:107)2 \u2212 (cid:107) \u02c6w (cid:107)2)\n\u21d2 ( \u02c6w \u2212 w)T \u2206v\u03c0 < Vl(w) \u2212 Vl( \u02c6w) + ( \u02c6w \u2212 w)T \u02c6v\u03c0E + \u03bb((cid:107) w (cid:107)2 \u2212 (cid:107) \u02c6w (cid:107)2)\n\u21d2 Vl( \u02c6w) \u2212(cid:0) \u02c6wT v\u03c0E \u2212 \u03bb\n\n(cid:1) < Vl(w) \u2212(cid:0)wT v\u03c0E \u2212 \u03bb\n\n2\n(cid:107) w (cid:107)2\n\n(cid:107) \u02c6w (cid:107)2\n\n(cid:1)\n\n2\n\n2\n\n2\n\n(H\u00a8older)\n\nv\u03c0E \u2212 \u03bb\n\nIn other terms, the point ( \u02c6wT v\u03c0E \u2212 \u03bb\n2 (cid:107) \u02c6w (cid:107)2) is closer to the surface Vl than any other point\n(wT v\u03c0E \u2212 \u03bb\n2 (cid:107) w (cid:107)2), where w is a point on the sphere centered around \u02c6w with a radius of \u0001.\n2 (cid:107) wE (cid:107)2) is by de\ufb01nition the closest point to\nSince the function Vl is convex and (wE T\nthe surface Vl, then wE should be inside the ball centered around \u02c6w with a radius of \u0001. Therefore,\n(cid:107) wE \u2212 \u02c6w (cid:107)2(cid:54) \u0001 and thus (cid:107) \u2206w (cid:107)2(cid:54) \u0001. (cid:3)\nConsequently, the reward loss (cid:107) \u2206w (cid:107)2 approaches zero as the error of the estimated feature fre-\nquencies (cid:107) \u2206v\u03c0 (cid:107)2 approaches zero. A simpler bound can be easily derived given admissible\nheuristics of Vl.\nCorollary: Let Vl and Vl be respectively a lower and an upper bound on Vl, then Proposition (1)\nholds if Vl(w) \u2212 Vl( \u02c6w) is replaced by Vl(w) \u2212 Vl( \u02c6w).\nFigure (1) illustrates the divergence from the optimal reward weight wE when approximate fre-\nquencies are used. The error is not a continuous function of \u2206v\u03c0 when the cost function is not\nregularized, because the vector returned by MMP is always a fringe point. Informally, the error is\nproportional to the maximum subgradient of the function Vl \u2212 v\u03c0E at the fringe point wE.\n\n4\n\n\f5 Bootstrapping Maximum Margin Planning\n\nThe feature frequency error \u2206v\u03c0 can be signi\ufb01cantly reduced by using the known transition func-\ntion for calculating \u02c6v\u03c0E and solving the \ufb02ow Equations (1), instead of the Monte Carlo estimator\n(Equation (5)). However, this cannot be done unless the complete expert\u2019s policy \u03c0E is provided.\nAssuming that the expert\u2019s policy \u03c0E is optimal and deterministic, the value wT F \u00b5\u03c0E in Equa-\ntion (2) can be replaced by max\u00b5\u2208G\u03c0E wT F \u00b5, the value of the optimal policy, according to the\ncurrent reward weight w, that selects the same actions as the expert in all the states that occurred in\nthe demonstration. The cost function of the bootstrapped Maximum Margin Planning becomes:\n\ncq(w) =\n\n(cid:16)\n(cid:88)\n\ns(cid:48)\u2208Se\n\n(cid:17)q\n(cid:88)\n\nwhere G\u03c0E is the set of vectors \u00b5\u03c0, subject to the following modi\ufb01ed Bellman-\ufb02ow constraints:\n\u00b5\u03c0(s) = \u03b1(s) + \u03b3\n\n\u03c0E(s(cid:48), a)T a(s(cid:48), s) + \u03b3\n\n\u00b5\u03c0(s(cid:48), a)T a(s(cid:48), s)\n\n(cid:88)\n\n\u00b51\u2208G(wT F + l)\u00b51 \u2212 max\nmax\n\u00b52\u2208G\u03c0E\n\u00b5\u03c0(s(cid:48))(cid:88)\n\na\u2208A\n\nwT F \u00b52\n\n(cid:107) w(cid:107)2\n\n+ \u03bb\n2\n\n(cid:88)\n\na\u2208A\n\na\u2208A\n\ns(cid:48)\u2208S\\Se\n\u00b5\u03c0(s, a) = \u00b5\u03c0(s), \u00b5\u03c0(s, a) (cid:62) 0\n\n(6)\n\n(7)\n\nSe is the set of states encountered in the demonstrations, where the expert\u2019s policy is known.\nIn fact, it corre-\nUnfortunately, the new cost function (Equation (6)) is not necessarily convex.\nsponds to a margin between two convex functions: the value of the bootstrapped expert\u2019s policy\nmax\u00b5\u2208G\u03c0E wT F \u00b5 and the value of the best alternative policy max\u00b5\u2208G(wT F + l)\u00b5. Yet, a local\noptimal solution of this modi\ufb01ed cost function can be found by using the same subgradient as in\nEquation (3), and replacing \u00b5\u03c0E by arg max\u00b5\u2208G\u03c0E wT F \u00b5. In practice, as we will show in the ex-\nperimental analysis, the solution returned by the bootstrapped MMP outperforms the solution of\nMMP where the expert\u2019s frequency is calculated without taking into account the known transition\nprobabilities. This improvement is particularly pronounced in highly stochastic environments. The\ncomputational cost of minimizing this modi\ufb01ed cost function is twice the one of MMP, since two\noptimal policies are found at each iteration.\nIn the remainder of this section, we provide a theoretical analysis of the cost function given by\nEquation (6). For the sake of simplicity, we consider q = 1 and \u03bb = 0.\nProposition 2 The cost function de\ufb01ned by Equation (6), has at most |A||S|\n\n|A||Se| different local minima.\n\nProof If q = 1 and \u03bb = 0, then the cost cq(w) corresponds to a distance between the convex\nand piecewise linear functions max\u00b5\u2208G(wT F + l)\u00b5 and max\u00b5\u2208G\u03c0E wT F \u00b5. Therefore, for any\nvector \u00b5(cid:48) \u2208 G\u03c0E , the function cq is monotone in the interval of w where \u00b5(cid:48) is optimal, i.e. where\nwT F \u00b5(cid:48) = max\u00b5\u2208G\u03c0E wT F \u00b5. Consequently, the number of local minima of the function cq is at\nmost equal to the number of optimal vectors \u00b5 in G\u03c0E , which is upper bounded by the number of\ndeterministic policies de\ufb01ned on S\\Se, i.e. by |A||S|\u2212|Se|. (cid:3)\n\nConsequently, the number of different local minima of the function cq decreases as the number of\nstates covered by the demonstration increases. Ultimately, the function cq becomes convex when the\ndemonstration covers all the possible states.\nTheorem 1 If there exists a reward weight vector w\u2217 \u2208 Rk, such that the expert\u2019s policy \u03c0E is the\nonly optimal policy with w\u2217, i.e. arg max\u00b5\u2208G w\u2217T F \u00b5 = {\u00b5\u03c0E}, then there exists \u03b1 > 0 such that:\n(i), the expert\u2019s policy \u03c0E is the only optimal policy with \u03b1w\u2217, and (ii), cq(\u03b1w\u2217) is a local minimum\nof the function cq, de\ufb01ned in Equation (6).\nProof The set of subgradients of function cq at a point w \u2208 Rk, denoted by \u2207wcq(w), corre-\nsponds to vectors F \u00b5(cid:48) \u2212 F \u00b5(cid:48)(cid:48), with \u00b5(cid:48) \u2208 arg max\u00b5\u2208G(wT F + l)\u00b5 and \u00b5(cid:48)(cid:48) \u2208 arg max\u00b5\u2208G\u03c0E wT F \u00b5.\nIn order that cq(w) will be a local minimum, it suf\ufb01ces to ensure that (cid:126)0 \u2208 \u2207wcq(w), i.e.\n\u2203\u00b5(cid:48) \u2208 arg max\u00b5\u2208G(wT F + l)\u00b5,\u2203\u00b5(cid:48)(cid:48) \u2208 arg max\u00b5\u2208G\u03c0E wT F \u00b5 such that F \u00b5(cid:48) = F \u00b5(cid:48)(cid:48). Let w\u2217 \u2208 Rk\n\n5\n\n\fbe a reward weight vector such that \u03c0E is the only optimal policy, and let \u0001 = w\u2217T F \u00b5\u03c0E \u2212 w\u2217T F \u00b5(cid:48)\nwhere \u00b5(cid:48) \u2208 arg max\u00b5\u2208G\u2212{\u00b5\u03c0E } w\u2217T F \u00b5. Then, \u03b1w\u2217T F \u00b5\u03c0E \u2212 \u03b1w\u2217T F \u00b5(cid:48) = 2|Se|\n1\u2212\u03b3 , where\n\u03b1 = 2|Se|\n\u0001(1\u2212\u03b3). Notice that by multiplying w\u2217 by \u03b1 > 0, \u03c0E remains the only optimal policy,\ni.e. arg max\u00b5\u2208G \u03b1w\u2217T F \u00b5 = {\u00b5\u03c0E}, and \u00b5(cid:48) \u2208 arg max\u00b5\u2208G\u2212{\u00b5\u03c0E } \u03b1w\u2217T F \u00b5. Therefore, it suf-\n\ufb01ces to show that \u00b5\u03c0E \u2208 arg max\u00b5\u2208G(\u03b1w\u2217T F + l)\u00b5. Indeed, max\u00b5\u2208G\u2212{\u03c0E}(\u03b1w\u2217T F + l)\u00b5 (cid:54)\n(cid:54) \u03b1w\u2217T F \u00b5\u03c0E\u2212 |Se|\n1\u2212\u03b3 ,\n\nmax\u00b5\u2208G\u2212{\u03c0E} \u03b1w\u2217T F \u00b5+max\u00b5\u2208G\u2212{\u03c0E} l\u00b5 (cid:54)(cid:0)\u03b1w\u2217T F \u00b5\u03c0E\u2212 2|Se|\n\n(cid:1)+ |Se|\n\n1\u2212\u03b3\n\n1\u2212\u03b3\n\ntherefore, \u00b5\u03c0E \u2208 arg max\u00b5\u2208G(\u03b1w\u2217T F + l)\u00b5.(cid:3)\n\n6 Bootstrapping Linear Programming Apprenticeship Learning\n\nAs with MMP, the feature frequencies in LPAL can be analytically calculated only when a complete\npolicy \u03c0E of the expert is provided. Alternatively, the same error bound V (\u03c0) (cid:62) V (\u03c0E) + v can be\nguaranteed by setting v = mini=0,...,k\u22121 min\u03c0(cid:48)\u2208\u03a0E [vi,\u03c0\u2212vi,\u03c0(cid:48)], where \u03a0E denotes the set of all the\npolicies that select the same actions as the expert in all the states that occurred in the demonstration,\nassuming \u03c0E is deterministic (In LPAL, \u03c0E is not necessarily an optimal policy). Instead of enumer-\nating all the policies of the set \u03a0E in the constraints, note that v = mini=0,...,k\u22121[vi,\u03c0 \u2212 vE\ni ], where\ndef= max\u03c0(cid:48)\u2208\u03a0E vi,\u03c0(cid:48) for each feature i. Therefore, LPAL can be reformulated as maximizing the\nvE\nmargin mini=0,...,k\u22121[vi,\u03c0 \u2212 vE\ni\ni ].\nThe maximal margin is found by solving the following linear program:\n\nv\n\nmax\nv,\u00b5\u03c0\nsubject to\n\n\u2200i \u2208 {0, . . . , k \u2212 1} : v (cid:54)(cid:88)\n(cid:124)\n\ns\u2208S\n\n(cid:88)\n\na\u2208A\n\n(cid:88)\n\n(cid:88)\n\ns(cid:48)\u2208S\n\na\u2208A\n\n(cid:123)(cid:122)\n\nvi,\u03c0\n\n(cid:88)\n\na\u2208A\n\n\u00b5\u03c0(s, a)fi(s, a)\n\n\u00b5i,\u03c0(cid:48)(s, a)fi(s, a)\n\n(cid:88)\n\na\u2208A\n\n\u2212(cid:88)\n(cid:124)\n\ns\u2208S\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nvE\ni\n\n(cid:125)\n\n\u00b5\u03c0(s) = \u03b1(s) + \u03b3\n\n\u00b5\u03c0(s(cid:48), a)T (s(cid:48), a, s),\n\n\u00b5\u03c0(s, a) = \u00b5\u03c0(s), \u00b5\u03c0(s, a) (cid:62) 0\n\nwhere the values vE\nfeatures). For each feature i, vE\nweights w de\ufb01ned as: wi = 1 and wj = 0,\u2200j (cid:54)= i.\ni\n\ni are found by solving k separate optimization problems (k is the number of\nis the value of the optimal policy in the set \u03a0E under the reward\n\n7 Experimental Results\n\nTo validate our approach, we experimented on two simulated navigation problems: a gridworld and\ntwo racetrack domains, taken from (Boularias & Chaib-draa, 2010). While these are not meant to be\nchallenging tasks, they allow us to compare our approach to other methods of apprenticeship learn-\ning, namely MMP and LPAL with Monte Carlo estimation, and a simple classi\ufb01cation algorithm\nwhere the action in a given state is selected by performing a majority vote on the k-nearest neighbor\nstates where the expert\u2019s action is known. For each state, the distance k is gradually increased until\nat least one known state is encountered. The distance between two states corresponds to the shortest\npath between them with a positive probability.\n\n7.1 Gridworld\nWe consider 16 \u00d7 16 and 24 \u00d7 24 gridworlds. The state corresponds to the location of the agent on\nthe grid. The agent has four actions for moving in one of the four directions of the compass. The\nactions succeed with probability 0.9. The gridworld is divided into non-overlapping regions, and\nthe reward varies depending on the region in which the agent is located. For each region i, there is a\nfeature fi, where fi(s) indicates whether state s is in region i. The expert\u2019s policy \u03c0E corresponds\nto the optimal deterministic policy found by value iteration. In all our experiments on gridworlds,\nwe used only 10 demonstration trajectories, which is a signi\ufb01cantly small number compared to other\nmethods ( Neu & Szepesvri (2007) for example). The duration of the trajectories is 50 time-steps.\n\n6\n\n\fSize\n\nFeatures\n\nExpert\n\nk-NN\n\nMMP + MC MMP + Bootstrap\n\nLPAL + MC\n\nLPAL + Bootstrap\n\n16 \u00d7 16\n16 \u00d7 16\n16 \u00d7 16\n24 \u00d7 24\n24 \u00d7 24\n24 \u00d7 24\n\n16\n64\n256\n\n64\n144\n576\n\n0.4672\n0.5281\n0.3988\n\n0.5210\n0.5916\n0.3102\n\n0.4635\n0.5198\n0.4062\n\n0.6334\n0.5876\n0.2814\n\n0.0000\n0.0000\n0.0537\n\n0.0000\n0.0122\n0.0974\n\n0.4678\n0.5252\n0.3828\n\n0.5217\n0.5252\n0.0514\n\n0.0380\n0.0255\n0.0555\n\n0.0149\n0.0400\n0.0439\n\n0.1572\n0.4351\n0.1706\n\n0.2767\n0.4432\n0.0349\n\nTable 1: Gridworld average reward results\n\nTable 1 shows the average reward per step of the learned policy, averaged over 103 independent trials\nof the same duration as the demonstration trajectories. Our \ufb01rst observation is that Bootstrapped\nMMP learned policies just as good as the expert\u2019s policy, while both MMP and LPAL using Monte\nCarlo (MC) estimator remarkably failed to collect any reward. This is due to the fact that we used a\nvery small number of demonstrations (10 \u00d7 50 time-steps) compared to the size of these problems.\nNote that this problem is not speci\ufb01c to MMP or LPAL. In fact, any other algorithm using the same\napproximation method would produce similar results. The second observation is that the values of\nthe policies learned by bootstrapped LPAL were between the values of LPAL with Monte Carlo\nand the optimal ones. In fact, the policy learned by the bootstrapped LPAL is one that minimizes\nthe difference between the expected frequency of a feature using this policy and the maximal one\namong all the policies that resemble to the expert\u2019s policy. Therefore, the learned policy maximizes\nthe frequency of a feature that is not necessary a good one (with a high reward weight). We also\nnotice that the performance of all the tested algorithms was low when 576 features were used. In\nthis case, every feature takes a non null weight in one state only. Therefore, the demonstrations did\nnot provide enough information about the rewards of the states that were not visited by the expert.\nFinally, we remark that k-NN performed as an expert in this experiment. In fact, since there are no\nobstacles on the grid, neighboring states often have similar optimal actions.\n\n7.2 Racetrack\n\nWe implemented a simpli\ufb01ed car race simulator, a detailed description of the corresponding race-\ntracks was provided in (Boularias & Chaib-draa, 2010). The states correspond to the position of the\ncar on the racetrack and its velocity. For racetrack (1), the car always starts from the same initial\nposition, and the duration of each demonstration trajectory is 20 time-steps. For racetrack (2), the\ncar starts at a random position, and the length of each trajectory is 40 time-steps. A high reward\nis given for reaching the \ufb01nish line, a low cost is associated to each movement, and high cost is\nassociated to driving off-road (or hitting an obstacle). Figure 2 (a-f) shows the average reward per\nstep of the learned policies, the average proportion of off-road steps, and the average number of\nsteps before reaching the \ufb01nish line, as a function of the number of trajectories in the demonstra-\ntion. We \ufb01rst notice that k-NN performed poorly, this is principally caused by the effect of driving\noff-road on both the cumulated reward and the velocity of the car. In this context, neighbor states\ndo not necessarily share the same optimal action. Contrary to the gridworld experiments, MMP\nwith Monte Carlo achieved good performances on racetrack (1). In fact, by \ufb01xing the initial state,\nthe demonstration covers most of the reachable states, and the feature frequencies are accurately\nestimated from the demonstration. On racetrack (2) however, MMP with MC was unable to learn a\ngood policy because all the states were reachable from the initial distribution. Similarly, LPAL with\nboth MC and bootstrapping failed to achieve good results on racetracks (1) and (2). This is due to\nthe fact that LPAL tries to maximize the frequency of features that are not necessary associated to\na high reward, such as hitting obstacles. Finally, we notice the nearly optimal performance of the\nbootstrapped MMP, on both racetracks (1) and (2).\n\n8 Conclusion and Future Work\n\nThe main question of apprenticeship learning is how to generalize the expert\u2019s policy to states that\nhave not been encountered during the demonstration. Inverse Reinforcement Learning (IRL) pro-\nvides an ef\ufb01cient answer which consists in \ufb01rst learning a reward function that explains the observed\nbehavior, and then using it for the generalization. A strong assumption considered in IRL-based al-\n\n7\n\n\f(a) Average reward in racetrack 1\n\n(b) Average number of steps in racetrack 1\n\n(c) Average number of off-roads, racetrack 1\n\n(d) Average reward in racetrack 2\n\n(e) Average number of steps in racetrack 2\n\n(f) Average number of off-roads, racetrack 2\n\nFigure 2: Racetrack results\n\ngorithms is that the reward is a linear function of state-action features, and the frequencies of these\nfeatures can be estimated from a few demonstrations even if these demonstrations cover only a small\npart of the state space. In this paper, we showed that this assumption does not hold in highly stochas-\ntic systems. We also showed that this problem can be solved by modifying the cost function so that\nthe value of the learned policy is compared to the exact value of a generalized expert\u2019s policy. We\nalso provided theoretical insights on the modi\ufb01ed cost function, showing that it admits the expert\u2019s\ntrue reward as a locally optimal solution, under mild conditions. The empirical analysis con\ufb01rmed\nthe outperformance of Bootstrapped MMP in particular. These promising results push us to further\ninvestigate the theoretical properties of the modi\ufb01ed cost function.\nAs a future work, we mainly target to compare this approach with the one proposed by Ratliff et al.\n(2007), where the base features are boosted by using a classi\ufb01er.\n\n8\n\n 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12Average reward per stepNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN 20 22 24 26 28 30 32 34 2 4 6 8 10 12Average number of stepsNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN 0 0.1 0.2 0.3 0.4 0.5 2 4 6 8 10 12Average number of hitted obstacles per stepNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN 0 5 10 15 20 2 4 6 8 10 12Average reward per stepNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN 20 30 40 50 60 2 4 6 8 10 12Average number of stepsNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN 0 0.2 0.4 0.6 0.8 1 1.2 1.4 2 4 6 8 10 12Average number of hitted obstacles per stepNumber of trajectories in the demonstrationExpertMMP + MCMMP + BootstrappingLPAL + MCLPAL + Bootstrappingk\u2212NN\fReferences\nAbbeel, Pieter and Ng, Andrew Y. Apprenticeship Learning via Inverse Reinforcement Learning. In\nProceedings of the Twenty-\ufb01rst International Conference on Machine Learning (ICML\u201904), pp.\n1\u20138, 2004.\n\nBoularias, Abdeslam and Chaib-draa, Brahim. Apprenticeship Learning via Soft Local Homomor-\nIn Proceedings of 2010 IEEE International Conference on Robotics and Automation\n\nphisms.\n(ICRA\u201910), pp. 2971\u20132976, 2010.\n\nNeu, Gergely and Szepesvri, Csaba. Apprenticeship Learning using Inverse Reinforcement Learning\nand Gradient Methods. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI\u201907), pp. 295\u2013\n302, 2007.\n\nNg, Andrew and Russell, Stuart. Algorithms for Inverse Reinforcement Learning. In Proceedings of\nthe Seventeenth International Conference on Machine Learning (ICML\u201900), pp. 663\u2013670, 2000.\nRamachandran, Deepak and Amir, Eyal. Bayesian Inverse Reinforcement Learning. In Proceedings\nof The twentieth International Joint Conference on Arti\ufb01cial Intelligence (IJCAI\u201907), pp. 2586\u2013\n2591, 2007.\n\nRatliff, N., Bagnell, J., and Zinkevich, M. Maximum Margin Planning.\n\nIn Proceedings of the\n\nTwenty-third International Conference on Machine Learning (ICML\u201906), pp. 729\u2013736, 2006.\n\nRatliff, Nathan, Bradley, David, Bagnell, J. Andrew, and Chestnutt, Joel. Boosting Structured\nIn Advances in Neural Information Processing Systems 19\n\nPrediction for Imitation Learning.\n(NIPS\u201907), pp. 1153\u20131160, 2007.\n\nSyed, Umar and Schapire, Robert. A Game-Theoretic Approach to Apprenticeship Learning. In\n\nAdvances in Neural Information Processing Systems 20 (NIPS\u201908), pp. 1449\u20131456, 2008.\n\nSyed, Umar, Bowling, Michael, and Schapire, Robert E. Apprenticeship Learning using Linear\nProgramming. In Proceedings of the Twenty-\ufb01fth International Conference on Machine Learning\n(ICML\u201908), pp. 1032\u20131039, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1136, "authors": [{"given_name": "Abdeslam", "family_name": "Boularias", "institution": null}, {"given_name": "Brahim", "family_name": "Chaib-draa", "institution": null}]}