{"title": "Repeated Inverse Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1815, "page_last": 1824, "abstract": "We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.", "full_text": "Repeated Inverse Reinforcement Learning\n\nKareem Amin\u2217\nGoogle Research\n\nNew York, NY 10011\nkamin@google.com\n\nNan Jiang\u2217\nSatinder Singh\nComputer Science & Engineering,\n\nUniversity of Michigan, Ann Arbor, MI 48104\n\n{nanjiang,baveja}@umich.edu\n\nAbstract\n\nWe introduce a novel repeated Inverse Reinforcement Learning problem: the agent\nhas to act on behalf of a human in a sequence of tasks and wishes to minimize the\nnumber of tasks that it surprises the human by acting suboptimally with respect to\nhow the human would have acted. Each time the human is surprised, the agent is\nprovided a demonstration of the desired behavior by the human. We formalize this\nproblem, including how the sequence of tasks is chosen, in a few different ways\nand provide some foundational results.\n\n1\n\nIntroduction\n\nOne challenge in building AI agents that learn from experience is how to set their goals or rewards.\nIn the Reinforcement Learning (RL) setting, one interesting answer to this question is inverse RL\n(or IRL) in which the agent infers the rewards of a human by observing the human\u2019s policy in a task\n[2]. Unfortunately, the IRL problem is ill-posed for there are typically many reward functions for\nwhich the observed behavior is optimal in a single task [3]. While the use of heuristics to select from\namong the set of feasible reward functions has led to successful applications of IRL to the problem\nof learning from demonstration [e.g., 4], not identifying the reward function poses fundamental\nchallenges to the question of how well and how safely the agent will perform when using the learned\nreward function in other tasks.\nWe formalize multiple variations of a new repeated IRL problem in which the agent and (the same)\nhuman face multiple tasks over time. We separate the reward function into two components, one\nwhich is invariant across tasks and can be viewed as intrinsic to the human, and a second that is\ntask speci\ufb01c. As a motivating example, consider a human doing tasks throughout a work day, e.g.,\ngetting coffee, driving to work, interacting with co-workers, and so on. Each of these tasks has a\ntask-speci\ufb01c goal, but the human brings to each task intrinsic goals that correspond to maintaining\nhealth, \ufb01nancial well-being, not violating moral and legal principles, etc. In our repeated IRL setting,\nthe agent presents a policy for each new task that it thinks the human would do. If the agent\u2019s policy\n\u201csurprises\u201d the human by being sub-optimal, the human presents the agent with the optimal policy.\nThe objective of the agent is to minimize the number of surprises to the human, i.e., to generalize the\nhuman\u2019s behavior to new tasks.\nIn addition to addressing generalization across tasks, the repeated IRL problem we introduce and\nour results are of interest in resolving the question of unidenti\ufb01ability of rewards from observations\nin standard IRL. Our results are also of interest to a particular aspect of the concern about how\nto make sure that the AI systems we build are safe, or AI safety. Speci\ufb01cally, the issue of reward\nmisspeci\ufb01cation is often mentioned in AI safety articles [e.g., 5, 6, 7]. These articles mostly discuss\nbroad ethical concerns and possible research directions, while our paper develops mathematical\nformulations and algorithmic solutions to a speci\ufb01c way of addressing reward misspeci\ufb01cation.\n\n*This paper extends an unpublished arXiv paper by the authors [1].\n\u2217Equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn summary form, our contributions include: (1) an ef\ufb01cient reward-identi\ufb01cation algorithm when the\nagent can choose the tasks in which it observes human behavior; (2) an upper bound on the number of\ntotal surprises when no assumptions are made on the tasks, along with a corresponding lower bound;\n(3) an extension to the setting where the human provides sample trajectories instead of complete\nbehavior; and (4) identi\ufb01cation guarantees when the agent can only choose the task rewards but is\ngiven a \ufb01xed task environment.\n\nvalue function or long-term utility of \u03c0 is de\ufb01ned as V \u03c0(s) = (1\u2212\u03b3) E[(cid:80)\u221e\nSimilarly, the Q-value function is Q\u03c0(s, a) = (1\u2212 \u03b3) E[(cid:80)\u221e\n\n2 Markov Decision Processes (MDPs)\nAn MDP is speci\ufb01ed by its state space S, action space A, initial state distribution \u00b5 \u2208 \u2206(S), transition\nfunction (or dynamics) P : S \u00d7 A \u2192 \u2206(S), reward function Y : S \u2192 R, and discount factor\n\u03b3 \u2208 [0, 1). We assume \ufb01nite S and A, and \u2206(S) is the space of all distributions over S. A policy \u03c0 :\nS \u2192 A describes an agent\u2019s behavior by specifying the action to take in each state. The (normalized)\nt=1 \u03b3t\u22121Y (st)|s0 = s; \u03c0].2\nt=1 \u03b3t\u22121Y (st)|s0 = s, a0 = a; \u03c0]. Where\nnecessary we will use the notation V \u03c0\nP,Y to avoid ambiguity about the dynamics and the reward\nfunction. Let \u03c0(cid:63) : S \u2192 A be an optimal policy, which maximizes V \u03c0 and Q\u03c0 in all states (and\nactions) simultaneously.\nGiven an initial distribution over states, \u00b5, a scalar value that measures the goodness of \u03c0 is de\ufb01ned\nas Es\u223c\u00b5[V \u03c0(s)]. We introduce some further notation to express Es\u223c\u00b5[V \u03c0(s)] in vector-matrix form.\n\u00b5,P \u2208 R|S| be the normalized state occupancy under initial distribution \u00b5, dynamics P , and\nLet \u03b7\u03c0\nt=1 \u03b3t\u22121I(st = s)|s0 \u223c \u00b5; \u03c0] (I(\u00b7) is the indicator function).\nThis vector can be computed in closed-form as \u03b7\u03c0\n, where\nP \u03c0 is an |S| \u00d7 |S| matrix whose (s, s(cid:48))-th element is P (s(cid:48)|s, \u03c0(s)), and I|S| is the |S| \u00d7 |S| identity\nmatrix. For convenience we will also treat the reward function Y as a vector in R|S|, and we have\n(1)\n\npolicy \u03c0, whose s-th entry is (1\u2212\u03b3) E[(cid:80)\u221e\n\n(cid:16)\n\n\u00b5(cid:62)P \u03c0 (cid:0)I|S| \u2212 \u03b3P \u03c0(cid:1)\u22121(cid:17)(cid:62)\n\n\u00b5,P = (1 \u2212 \u03b3)\n\nEs\u223c\u00b5[V \u03c0(s)] = Y (cid:62)\u03b7\u03c0\n\n\u00b5,P .\n\n3 Problem setup\n\nHere we de\ufb01ne the repeated IRL problem. The human\u2019s reward function \u03b8(cid:63) captures his/her safety\nconcerns and intrinsic/general preferences. This \u03b8(cid:63) is unknown to the agent and is the object of\ninterest herein, i.e., if \u03b8(cid:63) were known to the agent, the concerns addressed in this paper would be\nsolved. We assume that the human cannot directly communicate \u03b8(cid:63) to the agent but can evaluate the\nagent\u2019s behavior in a task as well as demonstrate optimal behavior. Each task comes with an external\nreward function R, and the goal is to maximize the reward with respect to Y := \u03b8(cid:63) + R in each task.\nAs a concrete example, consider an agent for an autonomous vehicle. In this case, \u03b8(cid:63) represents the\ncross-task principles that de\ufb01ne good driving (e.g., courtesy towards pedestrians and other vehicles),\nwhich are often dif\ufb01cult to explicitly describe. In contrast, R, the task-speci\ufb01c reward, could reward\nthe agent for successfully completing parallel parking. While R is easier to construct, it may not\ncompletely capture what a human deems good driving. (For example, an agent might successfully\nparallel park while still boxing in neighboring vehicles.)\nMore formally, a task is de\ufb01ned by a pair (E, R), where E = (S,A, \u00b5, P, \u03b3) is the task environment\n(i.e., a controlled Markov process) and R is the task-speci\ufb01c reward function (task reward). We\nassume that all tasks share the same S,A, \u03b3, with |A| \u2265 2, but may differ in the initial distribution \u00b5,\ndynamics P , and task reward R; all of the task-specifying quantities are known to the agent. In any\ntask, the human\u2019s optimal behavior is always with respect to the reward function Y = \u03b8(cid:63) + R. We\nemphasize again that \u03b8(cid:63) is intrinsic to the human and remains the same across all tasks. Our use of\ntask speci\ufb01c reward functions R allows for greater generality than the usual IRL setting, and most\nof our results apply equally to the case where R \u2261 0.\nWhile \u03b8(cid:63) is private to the human, the agent has some prior knowledge on \u03b8(cid:63), represented as a set of\npossible parameters \u03980 \u2282 R|S| that contains \u03b8(cid:63). Throughout, we assume that the human\u2019s reward\nhas bounded and normalized magnitude, that is, (cid:107)\u03b8(cid:63)(cid:107)\u221e \u2264 1.\n\n2Here we differ (w.l.o.g.) from common IRL literature in assuming that reward occurs after transition.\n\n2\n\n\fA demonstration in (E, R) reveals \u03c0(cid:63), optimal for Y = \u03b8(cid:63) + R under environment E, to the agent. A\ncommon assumption in the IRL literature is that the full mapping is revealed, which can be unrealistic\nif some states are unreachable from the initial distribution. We address the issue by requiring only\nthe state occupancy vector \u03b7\u03c0\u2217\n\u00b5,P . In Section 7 we show that this also allows an easy extension to the\nsetting where the human only demonstrates trajectories instead of providing a policy.\nUnder the above framework for repeated IRL, we consider two settings that differ in how the sequence\nof tasks are chosen. In both settings, we will want to minimize the number of demonstrations needed.\n1. (Section 5) Agent chooses the tasks, observes the human\u2019s behavior in each of them, and infers the\nreward function. In this setting where the agent is powerful enough to choose tasks arbitrarily, we\nwill show that the agent will be able to identify the human\u2019s reward function which of course implies\nthe ability to generalize to new tasks.\n2. (Section 6) Nature chooses the tasks, and the agent proposes a policy in each task. The human\ndemonstrates a policy only if the agent\u2019s policy is signi\ufb01cantly suboptimal (i.e., a mistake). In this\nsetting we will derive upper and lower bounds on the number of mistakes our agent will make.\n\n4 The challenge of identifying rewards\n\nNote that it is impossible to identify \u03b8(cid:63) from watching human behavior in a single task. This is\nbecause any \u03b8(cid:63) is fundamentally indistinguishable from an in\ufb01nite set of reward functions that yield\nexactly the policy observed in the task. We introduce the idea of behavioral equivalence below to\ntease apart two separate issues wrapped up in the challenge of identifying rewards.\nDe\ufb01nition 1. Two reward functions \u03b8, \u03b8(cid:48) \u2208 R|S| are behaviorally equivalent in all MDP tasks, if for\nany (E, R), the set of optimal policies for (R + \u03b8) and (R + \u03b8(cid:48)) are the same.\nWe argue that the task of identifying the reward function should amount only to identifying the\n(behavioral) equivalence class to which \u03b8(cid:63) belongs. In particular, identifying the equivalence class\nis suf\ufb01cient to get perfect generalization to new tasks. Any remaining unidenti\ufb01ability is merely\nrepresentational and of no real consequence. Next we present a constraint that captures the reward\nfunctions that belong to the same equivalence class.\nProposition 1. Two reward functions \u03b8 and \u03b8(cid:48) are behaviorally equivalent in all MDP tasks if and\nonly if \u03b8 \u2212 \u03b8(cid:48) = c \u00b7 1|S| for some c \u2208 R, where 1|S| is an all-1 vector of length |S|.\nThe proof is elementary and deferred to Appendix A. For any class of \u03b8\u2019s that are equivalent to\neach other, we can choose a canonical element to represent this class. For example, we can \ufb01x an\narbitrary reference state sref \u2208 S, and \ufb01x the reward of this state to 0 for \u03b8(cid:63) and all candidate \u03b8\u2019s.\nIn the rest of the paper, we will always assume such canonicalization in the MDP setting, hence\n\u03b8(cid:63) \u2208 \u03980 \u2286 {\u03b8 \u2208 [\u22121, 1]|S| : \u03b8(sref) = 0}.\n\n5 Agent chooses the tasks\nIn this section, the protocol is that the agent chooses a sequence of tasks {(Et, Rt)}. For each task\n(Et, Rt), the human reveals \u03c0(cid:63)\nt , which is optimal for environment Et and reward function \u03b8(cid:63) + Rt.\nOur goal is to design an algorithm which chooses {(Et, Rt)} and identi\ufb01es \u03b8(cid:63) to a desired accuracy,\n\u0001, using as few tasks as possible. Theorem 1 shows that a simple algorithm can identify \u03b8(cid:63) after only\nO(log(1/\u0001)) tasks, if any tasks may be chosen. Roughly speaking, the algorithm amounts to a binary\nsearch on each component of \u03b8(cid:63) by manipulating the task reward Rt.3 See the proof for the algorithm\nspeci\ufb01cation. As noted before, once the agent has identi\ufb01ed \u03b8(cid:63) within an appropriate tolerance, it can\ncompute a suf\ufb01ciently-near-optimal policy for all tasks, thus completing the generalization objective\nthrough the far stronger identi\ufb01cation objective in this setting.\nTheorem 1. If \u03b8(cid:63) \u2208 \u03980 \u2286 {\u03b8 \u2208 [\u22121, 1]|S| : \u03b8(sref) = 0}, there exists an algorithm that outputs\n\u03b8 \u2208 R|S| that satis\ufb01es (cid:107)\u03b8 \u2212 \u03b8(cid:63)(cid:107)\u221e \u2264 \u0001 after O(log(1/\u0001)) demonstrations.\nProof. The algorithm chooses the following \ufb01xed environment in all tasks: for each s \u2208 S \\ {sref},\nlet one action be a self-loop, and the other action transitions to sref. In sref, all actions cause self-loops.\n3While we present a proof that manipulates Rt, an only slightly more complex proof applies to the setting\n\nwhere all the Rt are exactly zero and the manipulation is limited to the environment [1].\n\n3\n\n\fThe initial distribution over states is uniformly at random over S \\ {sref}. Each task only differs in\nthe task reward Rt (where Rt(sref) \u2261 0 always). After observing the state occupancy of the optimal\npolicy, for each s we check if the occupancy is equal to 0. If so, it means that the demonstrated optimal\npolicy chooses to go to sref from s in the \ufb01rst time step, and \u03b8(cid:63)(s) + Rt(s) \u2264 \u03b8(cid:63)(sref) + Rt(sref) = 0;\nif not, we have \u03b8(cid:63)(s) + Rt(s) \u2265 0. Consequently, after each task we learn the relationship between\n\u03b8(cid:63)(s) and \u2212Rt(s) on each s \u2208 S \\ {sref}, so conducting a binary search by manipulating Rt(s) will\nidentify \u03b8(cid:63) to \u0001-accuracy after O(log(1/\u0001)) tasks.\n\n6 Nature chooses the tasks\n\nWhile Theorem 1 yields a strong identi\ufb01cation guarantee, it also relies on a strong assumption, that\n{(Et, Rt)} may be chosen by the agent in an arbitrary manner. In this section, we let nature, who is\nallowed to be adversarial for the purpose of the analysis, choose {(Et, Rt)}.\nGenerally speaking, we cannot obtain identi\ufb01cation guarantees in such an adversarial setup. As an\nexample, if Rt \u2261 0 and Et remains the same over time, we are essentially back to the classical IRL\nsetting and suffer from the degeneracy issue. However, generalization to future tasks, which is our\nultimate goal, is easy in this special case: after the initial demonstration, the agent can mimic it to\nbehave optimally in all subsequent tasks without requiring further demonstrations. More generally, if\nnature repeats similar tasks, then the agent obtains little new information, but presumably it knows\nhow to behave in most cases; if nature chooses a task unfamiliar to the agent, then the agent is likely\nto err, but it may learn about \u03b8(cid:63) from the mistake.\nTo formalize this intuition, we consider the following protocol: the nature chooses a sequence of\ntasks {(Et, Rt)} in an arbitrary manner. For every task (Et, Rt), the agent proposes a policy \u03c0t. The\nhuman examines the policy\u2019s value under \u00b5t, and if the loss\n\nlt = Es\u223c\u00b5\n\nV \u03c0(cid:63)\nEt, \u03b8(cid:63)+Rt\n\nt\n\n(s)\n\nV \u03c0t\nEt, \u03b8(cid:63)+Rt\n\n(s)\n\n(2)\n\n(cid:104)\n\n(cid:105) \u2212 Es\u223c\u00b5\n\n(cid:104)\n\n(cid:105)\n\nt\n\nt\n\n\u00b5t,Pt\n\n\u00b5t,Pt\n\nis revealed to the agent (note that \u03b7\u03c0(cid:63)\n\nis less than some \u0001 then the human is satis\ufb01ed and no demonstration is needed; otherwise a mistake is\ncounted and \u03b7\u03c0(cid:63)\ncan be computed by the agent if needed\nfrom \u03c0\u2217\nt and its knowledge of the task). The main goal of this section is to design an algorithm that\nhas a provable guarantee on the total number of mistakes.\nOn human supervision Here we require the human to evaluate the agent\u2019s policies in addition to\nproviding demonstrations. We argue that this is a reasonable assumption because (1) only a binary\nsignal I(lt > \u0001) is needed as opposed to the precise value of lt, and (2) if a policy is suboptimal but\nthe human fails to realize it, arguably it should not be treated as a mistake. Meanwhile, we will also\nprovide identi\ufb01cation guarantees in Section 6.4, as the human will be relieved from the supervision\nduty once \u03b8(cid:63) is identi\ufb01ed.\nBefore describing and analyzing our algorithm, we \ufb01rst notice that the Equation 2 can be rewritten as\n\nlt = (\u03b8(cid:63) + R)(cid:62)(\u03b7\u03c0(cid:63)\n\nt\n\n\u2212 \u03b7\u03c0t\n\n(3)\nusing Equation 1. So effectively, the given environment Et in each round induces a set of state\noccupancy vectors {\u03b7\u03c0\n: \u03c0 \u2208 (S \u2192 A)}, and we want the agent to choose the vector that has\nthe largest dot product with \u03b8(cid:63) + R. The exponential size of the set will not be a concern because\nour main result (Theorem 2) has no dependence on the number of vectors, and only depends on the\ndimension of those vectors. The result is enabled by studying the linear bandit version of the problem,\nwhich subsumes the MDP setting for our purpose and is also a model of independent interest.\n\n\u00b5t,Pt\n\n\u00b5t,Pt\n\n\u00b5t,Pt\n\n),\n\n6.1 The linear bandit setting\nIn the linear bandit setting, D is a \ufb01nite action space with size |D| = K. Each task is denoted as a\npair (X, R), where R is the task speci\ufb01c reward function as before. X = [x(1) \u00b7\u00b7\u00b7 x(K)] is a d \u00d7 K\nfeature matrix, where x(i) is the feature vector for the i-th action, and (cid:107)x(i)(cid:107)1 \u2264 1. When we reduce\nMDPs to linear bandits, each element of D corresponds to an MDP policy, and the feature vector is\nthe state occupancy of that policy.\nAs before, R, \u03b8(cid:63) \u2208 Rd are the task reward and the human\u2019s unknown reward, respectively. The initial\nuncertainty set for \u03b8(cid:63) is \u03980 \u2286 [\u22121, 1]d. The value of the i-th action is calculated as (\u03b8(cid:63) + R)(cid:62)x(i),\n\n4\n\n\fAlgorithm 1 Ellipsoid Algorithm for Repeated Inverse Reinforcement Learning\n1: Input: \u03980.\n2: \u03981 \u2190 MVEE(\u03980).\n3: for t = 1, 2, . . . do\n4:\n5:\n6:\n7:\nend if\n8:\n9: end for\n\nNature reveals (Xt, Rt).\nLearner plays at = arg maxa\u2208D c(cid:62)\nif lt > \u0001 then\n\nt , where ct is the center of \u0398t. \u0398t+1 \u2190 \u0398t.\nt ) \u2265 0}).\n\nt . \u0398t+1 \u2190 MVEE({\u03b8 \u2208 \u0398t : (\u03b8 \u2212 ct)(cid:62)(xa(cid:63)\n\nHuman reveals a(cid:63)\n\nt \u2212 xat\n\nt xa\n\nt\n\nand a(cid:63) is the action that maximizes this value. Every round the agent proposes an action a \u2208 D,\nwhose loss is de\ufb01ned as\n\nlt = (\u03b8(cid:63) + R)(cid:62)(xa(cid:63) \u2212 xa).\n\nAs before, a mistake is counted when lt > \u0001, in which case the optimal demonstration xa(cid:63) is provided\nto the agent. We reiterate here that the agent only receives a binary signal I(lt > \u0001) in addition to the\ndemonstration. We use the term \u201clinear bandit\u201d to refer to the generative process, but our interaction\nprotocol differs from those in the standard bandit literature where reward or cost is revealed [8, 9].\nWe now show how to embed the previous MDP setting in the linear bandit setting.\nExample 1. Given an MDP problem with variables S,A, \u03b3, \u03b8(cid:63), sref, \u03980,{(Et, Rt)}, we can convert\nit into a linear bandit problem as follows: (all variables with prime belong to the linear bandit problem,\nand we use v\\i to denote the vector v with the i-th coordinate removed)\n\n\u2022 D = {\u03c0 : S \u2192 A}, d = |S| \u2212 1, \u03b8(cid:48)\n\u2022 x\u03c0\n\n, \u0398(cid:48)\n\\sref\nt \u2212 Rt(sref) \u00b7 1d.\n\n)\\sref. R(cid:48)\n\nt = (\u03b7\u03c0\n\nt = R\n\n(cid:63) = \u03b8\n\n\u00b5t,Pt\n\n\\sref\n(cid:63)\n\n0 = {\u03b8\\sref : \u03b8 \u2208 \u03980}.\n\n, R(cid:48)\n\nNote that there is a more straightforward conversion by letting d = |S|, \u03b8(cid:48)\nt =\nt = Rt, which also preserves losses. We perform a more succinct conversion in Example 1\n\u03b7\u03c0\n\u00b5t,Pt\nby canonicalizing both \u03b8(cid:63) (already assumed) and Rt (explicitly done here) and dropping the coordinate\nfor sref in all relevant vectors.\n\n(cid:63) = \u03b8(cid:63), \u0398(cid:48)\n\n0 = \u03980, x\u03c0\n\nMDPs with linear rewards\nIn IRL literature, a generalization of the MDP setting is often con-\nsidered, that reward is linear in state features \u03c6(s) \u2208 Rd [2, 3]. In this new setting, \u03b8(cid:63) and R are\nreward parameters, and the actual reward is (\u03b8(cid:63) + R)(cid:62)\u03c6(s). This new setting can also be reduced to\nthe linear bandit setting similarly to Example 1, except that the state occupancy is replaced by the\ndiscounted sum of expected feature values. Our main result, Theorem 2, will still apply automatically,\nbut now the guarantee will only depend on the dimension of the feature space and has no dependence\non |S|. We include the conversion below but do not further discuss this setting in the rest of the paper.\nExample 2. Consider an MDP problem with state features, de\ufb01ned by S,A, \u03b3, d \u2208 Z+, \u03b8(cid:63) \u2208\nRd, \u03980 \u2286 [\u22121, 1]d,{(Et, \u03c6t \u2208 Rd, Rt \u2208 Rd)}, where task reward and background reward in state s\n(cid:63) \u03c6t(s) and R(cid:62)\u03c6t(s) respectively, and \u03b8(cid:63) \u2208 \u03980. Suppose (cid:107)\u03c6t(s)(cid:107)\u221e \u2264 1 always holds, then we\nare \u03b8(cid:62)\ncan convert it into a linear bandit problem as follows: D = {\u03c0 : S \u2192 A}. d, \u03b8(cid:63), and Rt remain the\nsame. x\u03c0\nt is for the\npurpose of normalization, so that (cid:107)x\u03c0\n\nh=1 \u03b3h\u22121E[\u03c6(sh)| \u00b5t, Pt, \u03c0]/d. Note that the division of d in x\u03c0\n\nt = (1 \u2212 \u03b3)(cid:80)\u221e\n\nt (cid:107)1 \u2264 (cid:107)\u03c6(cid:107)1/d \u2264 (cid:107)\u03c6(cid:107)\u221e \u2264 1.\n\n6.2 Ellipsoid Algorithm for Repeated Inverse Reinforcement Learning\n\nWe propose Algorithm 1, and provide the mistake bound in the following theorem.\nTheorem 2. For \u03980 = [\u22121, 1]d, the number of mistakes made by Algorithm 1 is guaranteed to be\nO(d2 log(d/\u0001)).\n\nTo prove Theorem 2, we quote a result from linear programming literature in Lemma 1, which is\nfound in standard lecture notes (e.g., [10], Theorem 8.8; see also [11], Lemma 3.1.34).\nLemma 1 (Volume reduction in ellipsoid algorithm). Given any non-degenerate ellipsoid B in Rd\ncentered at c \u2208 Rd, and any non-zero vector v \u2208 Rd, let B+ be the minimum-volume enclosing\nellipsoid (MVEE) of {u \u2208 B : (u \u2212 c)(cid:62)v \u2265 0}. We have vol(B+)/vol(B) \u2264 e\n\n2(d+1) .\n\n\u2212 1\n\n5\n\n\fProof of Theorem 2. Whenever a mistake is made, we can induce the constraint (Rt + \u03b8(cid:63))(cid:62)(xa(cid:63)\nt ) > \u0001. Meanwhile, since at is greedy w.r.t. ct, we have (Rt + ct)(cid:62)(xa(cid:63)\nxat\nthe center of \u0398t as in Line 5. Taking the difference of the two inequalities, we obtain\n\nt \u2212\nt ) \u2264 0, where ct is\n\nt \u2212 xat\n\nt\n\nt\n\n(\u03b8(cid:63) \u2212 ct)(cid:62)(xa(cid:63)\n\nt \u2212 xat\n\nt\n\nt ) > \u0001.\n\n(4)\nTherefore, the update rule on Line 7 of Algorithm 1 preserves \u03b8(cid:63) in \u0398t+1. Since the update makes\na central cut through the ellipsoid, Lemma 1 applies and the volume shrinks every time a mistake\nis made. To prove the theorem, it remains to upper bound the initial volume and lower bound the\nterminal volume of \u0398t. We \ufb01rst show that an update never eliminates B\u221e(\u03b8(cid:63), \u0001/2), the (cid:96)\u221e ball\ncentered at \u03b8(cid:63) with radius \u0001/2. This is because, any eliminated \u03b8 satis\ufb01es (\u03b8 + ct)(cid:62)(xa(cid:63)\nt ) < 0.\nCombining this with Equation 4, we have\nt \u2212 xat\n\nt ) \u2264 (cid:107)\u03b8(cid:63) \u2212 \u03b8(cid:107)\u221e(cid:107)xa(cid:63)\n\nt (cid:107)1 \u2264 2(cid:107)\u03b8(cid:63) \u2212 \u03b8(cid:107)\u221e.\n\n\u0001 < (\u03b8(cid:63) \u2212 \u03b8)(cid:62)(xa(cid:63)\n\n(cid:84) B\u221e(\u03b8(cid:63), \u0001/2), which contains an (cid:96)\u221e ball with radius \u0001/4 at its smallest (when \u03b8(cid:63) is one of\n\nThe last step follows from (cid:107)x(cid:107)1 \u2264 1. We conclude that any eliminated \u03b8 should be \u0001/2 far\naway from \u03b8(cid:63) in (cid:96)\u221e distance. Hence, we can lower bound the volume of \u0398t for any t by that\nof \u03980\n\u03980\u2019s vertices). To simplify calculation, we relax this lower bound (volume of the (cid:96)\u221e ball) to the\nvolume of the inscribed (cid:96)2 ball.\nFinally we put everything together: let MT be the number of mistakes made from round 1 to T , Cd\nbe the volume of the unit hypersphere in Rd (i.e., (cid:96)2 ball with radius 1), and vol(\u00b7) denote the volume\nof an ellipsoid, we have\n\nt \u2212 xat\n\nt \u2212 xat\n\nt\n\nt\n\nt\n\nMT\n\n\u221a\n\u2264 log(vol(\u03981)) \u2212 log(vol(\u0398T +1)) \u2264 log(Cd(\n\nd)d) \u2212 log(Cd(\u0001/4)d) = d log\n\n4\n\nd\n\n.\n\n\u221a\n\n\u0001\n\n2(d + 1)\nSo MT \u2264 2d(d + 1) log 4\n\n\u221a\nd\n\u0001 = O(d2 log d\n\n\u0001 ).\n\n6.3 Lower bound\n\nIn Section 5, we get an O(log(1/\u0001)) upper bound on the number of demonstrations, which has no\ndependence on |S| (which corresponds to d + 1 in the linear bandit setting). Comparing Theorem 2\nto 1, one may wonder whether the polynomial dependence on d is an artifact of the inef\ufb01ciency of\nAlgorithm 1. We clarify this issue by proving a lower bound, showing that \u2126(d log(1/\u0001)) mistakes\nare inevitable in the worst case when nature chooses the tasks. We provide a proof sketch below, and\nthe complete proof is deferred to Appendix E.\nTheorem 3. For any randomized algorithm4 in the linear bandit setting, there always exists \u03b8(cid:63) \u2208\n[\u22121, 1]d and an adversarial sequence of {(Xt, Rt)} that potentially adapts to the algorithm\u2019s previous\ndecisions, such that the expected number of mistakes made by the algorithm is \u2126(d log(1/\u0001)).\nProof Sketch. We randomize \u03b8(cid:63) by sampling each element i.i.d. from Unif([\u22121, 1]). We will prove\nthat there exists a strategy of choosing (Xt, Rt) such that any algorithm\u2019s expected number of\nmistakes is \u2126(d log(1/\u0001), which proves the theorem as max is no less than average.\nIn our construction, Xt = [0d, ejt], where jt is some index to be speci\ufb01ed. Hence, every round\nthe agent is essentially asked to decided whether \u03b8(jt) \u2265 \u2212Rt(jt). The adversary\u2019s strategy goes in\nphases, and Rt remains the same during each phase. Every phase has d rounds where jt is enumerated\nover {1, . . . , d}.\nThe adversary will use Rt to shift the posterior on \u03b8(jt) + Rt(jt) so that it is centered around the\norigin; in this way, the agent has about 1/2 probability to make an error (regardless of the algorithm),\nand the posterior interval will be halved. Overall, the agent makes d/2 mistakes in each phase, and\nthere will be about log(1/\u0001) phases in total, which gives the lower bound.\n\nApplying the lower bound to MDPs The above lower bound is stated for the linear bandit setting. In\nprinciple, we need to prove lower bound for MDPs separately, because linear bandits are more general\nthan MDPs for our purpose, and the hard instances in linear bandits may not have corresponding\n\n4While our Algorithm 1 is deterministic, randomization is often crucial for online learning in general [12].\n\n6\n\n\fMDP instances. In Lemma 2 below, we show that a certain type of linear bandit instances can\nalways be emulated by MDPs with the same number of actions, and the hard instances constructed in\nTheorem 3 indeed satisfy the conditions for such a type; in particular, we require the feature vectors\nto be non-negative and have (cid:96)1 norm bounded by 1. As a corollary, an \u2126(|S| log(1/\u0001)) lower bound\nfor the MDP setting (even with a small action space |A| = 2) follows directly from Theorem 3. The\nproof of Lemma 2 is deferred to Appendix B.\nLemma 2 (Linear bandit to MDP conversion). Let (X, R) be a linear bandit task, and K be the\nnumber of actions. If every xa is non-negative and (cid:107)xa(cid:107)1 \u2264 1, then there exists an MDP task\n(E, R(cid:48)) with d + 1 states and K actions, such that under some choice of sref, converting (E, R(cid:48)) as\nin Example 1 recovers the original problem.\n\n6.4 On identi\ufb01cation when nature chooses tasks\n\nWhile Theorem 2 successfully controls the number of total mistakes, it completely avoids the\nidenti\ufb01cation problem and does not guarantee to recover \u03b8(cid:63). In this section we explore further\nconditions under which we can obtain identi\ufb01cation guarantees when Nature chooses the tasks.\nThe \ufb01rst condition, stated in Proposition 2, implies that if we have made all the possible mistakes,\nthen we have indeed identi\ufb01ed the \u03b8(cid:63), where the identi\ufb01cation accuracy is determined by the tolerance\nparameter \u0001 that de\ufb01nes what is counted as a mistake. Due to space limit, the proof is deferred to\nAppendix C.\nProposition 2. Consider the linear bandit setting. If there exists T0 such that for any round t \u2265 T0,\nno more mistakes can be ever made by the algorithm for any choice of (Et, Rt) and any tie-braking\nmechanism, then we have \u03b8(cid:63) \u2208 B\u221e(cT0, \u0001).\nWhile the above proposition shows that identi\ufb01cation is guaranteed if the agent exhausts the mistakes,\nthe agent has no ability to actively ful\ufb01ll this condition when nature chooses tasks. For a stronger\nidenti\ufb01cation guarantee, we may need to grant the agent some freedom in choosing the tasks.\nIdenti\ufb01cation with \ufb01xed environment Here we consider a setting that \ufb01ts in between Section 5\n(completely active) and Section 6.1 (completely passive), where the environment E (hence the\ninduced feature vectors {x(1), x(2), . . . , x(K)}) is given and \ufb01xed, and the agent can arbitrarily\nchoose the task reward Rt. The goal is to obtain identi\ufb01cation guarantee in this intermediate setting.\nUnfortunately, a degenerate case can be easily constructed that prevents the revelation of any in-\nformation about \u03b8(cid:63). In particular, if x(1) = x(2) = . . . = x(K), i.e., the environment is completely\nuncontrolled, then all actions are equally optimal and nothing can be learned. More generally, if\nfor some v (cid:54)= 0 we have v(cid:62)x(1) = v(cid:62)x(2) = . . . = v(cid:62)x(K), then we may never recover \u03b8(cid:63) along\nthe direction of v. In fact, Proposition 1 can be viewed as an instance of this result where v = 1|S|\n\u00b5,P \u2261 1), and that is why we have to remove such redundancy in Example 1 in\n(recall that 1(cid:62)\norder to discuss identi\ufb01cation in MDPs. Therefore, to guarantee identi\ufb01cation in a \ufb01xed environment,\nthe feature vectors must have signi\ufb01cant variation in all directions, and we capture this intuition\nby de\ufb01ning a diversity score spread(X) (De\ufb01nition 2) and showing that the identi\ufb01cation accuracy\ndepends inversely on the score (Theorem 4).\n\nDe\ufb01nition 2. Given the feature matrix X =(cid:2)x(1) x(2)\nspread(X) as the d-th largest singular value of (cid:101)X := X(IK \u2212 1\nsuch that after round T we have (cid:107)cT \u2212 \u03b8(cid:63)(cid:107)\u221e \u2264 \u0001(cid:112)(K \u2212 1)/2/spread(X).\n\nTheorem 4. For a \ufb01xed feature matrix X, if spread(X) > 0, then there exists a sequence\nR1, R2, . . . , RT with T = O(d2 log(d/\u0001)) and a sequence of tie-break choices of the algorithm,\n\n|S|\u03b7\u03c0\n\nx(K)(cid:3) whose size is d \u00d7 K, de\ufb01ne\n\n\u00b7\u00b7\u00b7\n\nK 1K1(cid:62)\nK).\n\n\u221a\n\nThe proof is deferred to Appendix D. The\nK dependence in Theorem 4 may be of concern as K can\nbe exponentially large. However, Theorem 4 also holds if we replace X by any matrix that consists\nof X\u2019s columns, so we may choose a small yet most diverse set of columns as to optimize the bound.\n\n7 Working with trajectories\n\nIn previous sections, we have assumed that the human evaluates the agent\u2019s performance based on the\nstate occupancy of the agent\u2019s policy, and demonstrates the optimal policy in terms of state occupancy\n\n7\n\n\fNature reveals (Et, Rt). Agent rolls-out a trajectory using \u03c0t greedily w.r.t. ct + Rt.\nif agent takes a in s with Q(cid:63)(s, a) < V (cid:63)(s) \u2212 \u0001 then\n\nAlgorithm 2 Trajectory version of Algorithm 1 for MDPs\n1: Input: \u03980, H, n.\n2: \u03981 \u2190 MVEE(\u03980), i \u2190 0, \u00afZ \u2190 0, \u00afZ (cid:63) \u2190 0.\n3: for t = 1, 2, . . . do\n4:\n5: \u0398t+1 \u2190 \u0398t.\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n\nend if\n\ni\n\nHuman produces an H-step trajectory from s. Let the empirical state occupancy be \u02c6z(cid:63),H\ni \u2190 i + 1, \u00afZ (cid:63) \u2190 \u00afZ (cid:63) + \u02c6z(cid:63),H\nLet zi be the state occupancy of \u03c0t from initial state s, and \u00afZ \u2190 \u00afZ + zi.\nif i = n then\n\n.\n\ni\n\n.\n\n\u0398t+1 \u2190 MVEE({\u03b8 \u2208 \u0398t : (\u03b8 \u2212 ct)(cid:62)( \u00afZ (cid:63) \u2212 \u00afZ) \u2265 0}).\n\ni \u2190 0, \u00afZ \u2190 0, \u00afZ (cid:63) \u2190 0.\n\nas well. In practice, we would like to instead assume that for each task, the agent rolls out a trajectory,\nand the human shows an optimal trajectory if he/she \ufb01nds the agent\u2019s trajectory unsatisfying. We\nare still concerned about upper bounding the number of total mistakes, and aim to provide a parallel\nversion of Theorem 2.\nUnlike in traditional IRL, in our setting the agent is also acting, which gives rise to many subtleties.\nFirst, the total reward on the agent\u2019s single trajectory is a random variable, and may deviate from the\nexpected value of its policy. Therefore, it is generally impossible to decide if the agent\u2019s policy is\nnear-optimal, and instead we assume that the human can check if each action that the agent takes\nin the trajectory is near-optimal: when the agent takes a at state s, an error is counted if and only if\nQ(cid:63)(s, a) < V (cid:63)(s) \u2212 \u0001. This criterion can be viewed as a noisy version of the one used in previous\nsections, as taking expectation of V (cid:63)(s) \u2212 Q(cid:63)(s, \u03c0(s)) over the occupancy induced by \u03c0 will recover\nEquation 2.\nWhile this resolves the issue on the agent\u2019s side, how should the human provide his/her optimal\ntrajectory? The most straightforward protocol is that the human rolls out a trajectory from the\ninitial distribution of the task, \u00b5t. We argue that this is not a reasonable protocol for two reasons:\n(1) in expectation, the reward collected by the human may be less than that by the agent, because\nconditioning on the event that an error is spotted may introduce a selection bias; (2) the human may\nnot encounter the problematic state in his/her own trajectory, hence the information provided in the\ntrajectory may be irrelevant.\nTo resolve this issue, we consider a different protocol where the human rolls out a trajectory using an\noptimal policy from the very state where the agent errs.\nNow we discuss how we can prove a parallel of Theorem 2 under this new protocol. First, let\u2019s\nassume that the demonstration were still given in the form a state occupancy vector starting at the\nproblematic state. In this case, we can reduce to the setting of Section 6 by changing \u00b5t to a point\nmass on the problematic state.5 To apply the algorithm and the analysis in Section 6, it remains\nto show that the notion of error in this section (a suboptimal action) implies the notion of error in\nSection 6 (a suboptimal policy): let s be the problematic state and \u03c0 be the agent\u2019s policy, we have\nV \u03c0(s) = Q\u03c0(s, \u03c0(s)) \u2264 Q(cid:63)(s, \u03c0(s)) < V (cid:63)(s) \u2212 \u0001. So whenever a suboptimal action is spotted in\nstate s, it indeed implies that the agent\u2019s policy is suboptimal for s as the initial state. Hence, we can\nrun Algorithm 1 as-is and Theorem 2 immediately applies.\nTo tackle the remaining issue that the demonstration is in terms of a single trajectory, we will not\nupdate \u0398t after each mistake as in Algorithm 1, but only make an update after every mini-batch of\nmistakes, and aggregate them to form accurate update rules. See Algorithm 2. The formal guarantee\nof the algorithm is stated in Theorem 5, whose proof is deferred to Appendix G.\n\n5At the \ufb01rst glance this might seem suspicious: the problematic state is random and depends on the learner\u2019s\ncurrent policy, but in RL the initial distribution is usually \ufb01xed and the learner has no control over it. This\nconcern is removed thanks to our adversarial setup on (Et, Rt) (of which \u00b5t is a component).\n\n8\n\n\fTheorem 5. \u2200\u03b4 \u2208 (0, 1), with probability at least 1\u2212 \u03b4, the number of mistakes made by Algorithm 2\nwhere d = |S|,6\nwith parameters \u03980 = [\u22121, 1]d, H =\n\n, and n =\n\n4d(d+1) log 6\n\nlog(\n\n\u221a\n\u0001\n\nd\n\n)\n\n(cid:108) log(12/\u0001)\n\n(cid:109)\n\n1\u2212\u03b3\n\n(cid:38)\n\n\u03b4\n32\u00012\n\n(cid:39)\n\nis at most \u02dcO( d2\n\n\u00012 log( d\n\n\u03b4\u0001 )).7\n\n8 Related work & Conclusions\n\nMost existing work in IRL focused on inferring the reward function8 using data acquired from a \ufb01xed\nenvironment [2, 3, 18, 19, 20, 21, 22]. There is prior work on using data collected from multiple \u2014\nbut exogenously \ufb01xed \u2014 environments to predict agent behavior [23]. There are also applications\nwhere methods for single-environment MDPs have been adapted to multiple environments [19].\nNevertheless, all these works consider the objective of mimicking an optimal behavior in the presented\nenvironment(s), and do not aim at generalization to new tasks that is the main contribution of this\npaper. Recently, Had\ufb01eld-Menell et al. [24] proposed cooperative inverse reinforcement learning,\nwhere the human and the agent act in the same environment, allowing the human to actively resolve\nthe agent\u2019s uncertainty on the reward function. However, they only consider a single environment (or\ntask), and the unidenti\ufb01ability issue of IRL still exists. Combining their interesting framework with\nour resolution to unidenti\ufb01ability (by multiple tasks) can be an interesting future direction.\n\nAcknowledgement\n\nThis work was supported in part by NSF grant IIS 1319365 (Singh & Jiang) and in part by a Rackham\nPredoctoral Fellowship from the University of Michigan (Jiang). Any opinions, \ufb01ndings, conclusions,\nor recommendations expressed here are those of the authors and do not necessarily re\ufb02ect the views\nof the sponsors.\n\nReferences\n[1] Kareem Amin and Satinder Singh. Towards resolving unidenti\ufb01ability in inverse reinforcement\n\nlearning. arXiv preprint arXiv:1601.06569, 2016.\n\n[2] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In Proceed-\n\nings of the 17th International Conference on Machine Learning, pages 663\u2013670, 2000.\n\n[3] Pieter Abbeel and Andrew Y Ng. Apprenticeship Learning via Inverse Reinforcement Learning.\nIn Proceedings of the 21st International Conference on Machine learning, page 1. ACM, 2004.\n\n[4] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforce-\nment learning to aerobatic helicopter \ufb02ight. Advances in neural information processing systems,\n19:1, 2007.\n\n[5] Nick Bostrom. Ethical issues in advanced arti\ufb01cial intelligence. Science Fiction and Philosophy:\n\nFrom Time Travel to Superintelligence, pages 277\u2013284, 2003.\n\n[6] Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and bene\ufb01cial\n\narti\ufb01cial intelligence. AI Magazine, 36(4):105\u2013114, 2015.\n\n[7] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[8] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\nto d = |S| \u2212 1 by dropping the sref coordinate in all relevant vectors but that complicates presentation.\n\n6Here we use the simpler conversion explained right after Example 1. We can certainly improve the dimension\n7A log log(1/\u0001) term is suppressed in \u02dcO(\u00b7).\n8While we do not discuss it here, in the economics literature, the problem of inferring an agent\u2019s utility from\nbehavior-queries has long been studied under the heading of utility or preference elicitation [13, 14, 15, 16, 17].\nWhile our result in Section 5 uses similar techniques to elicit the reward function, we do so purely by observing\nthe human\u2019s behavior without external source of information (e.g., query responses).\n\n9\n\n\f[9] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\n\nbandit feedback. In COLT, pages 355\u2013366, 2008.\n\n[10] Ryan O\u2019Donnell. 15-859(E) \u2013 linear and semide\ufb01nite programming: lecture notes. Carnegie\nhttps://www.cs.cmu.edu/afs/cs.cmu.edu/academic/\n\nMellon University, 2011.\nclass/15859-f11/www/notes/lecture08.pdf.\n\n[11] Martin Gr\u00f6tschel, L\u00e1szl\u00f3 Lov\u00e1sz, and Alexander Schrijver. Geometric algorithms and combina-\n\ntorial optimization, volume 2. Springer Science & Business Media, 2012.\n\n[12] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2011.\n\n[13] Urszula Chajewska, Daphne Koller, and Ronald Parr. Making rational decisions using adaptive\n\nutility elicitation. In AAAI/IAAI, pages 363\u2013369, 2000.\n\n[14] John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior (60th\n\nAnniversary Commemorative Edition). Princeton university press, 2007.\n\n[15] Kevin Regan and Craig Boutilier. Regret-based reward elicitation for markov decision processes.\nIn Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n444\u2013451. AUAI Press, 2009.\n\n[16] Kevin Regan and Craig Boutilier. Eliciting additive reward functions for markov decision\nprocesses. In IJCAI Proceedings-International Joint Conference on Arti\ufb01cial Intelligence,\nvolume 22, page 2159, 2011.\n\n[17] Constantin A Rothkopf and Christos Dimitrakakis. Preference elicitation and inverse reinforce-\nment learning. In Machine Learning and Knowledge Discovery in Databases, pages 34\u201348.\nSpringer, 2011.\n\n[18] Adam Coates, Pieter Abbeel, and Andrew Y Ng. Learning for control from multiple demonstra-\ntions. In Proceedings of the 25th international conference on Machine learning, pages 144\u2013151.\nACM, 2008.\n\n[19] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy\n\ninverse reinforcement learning. In AAAI, pages 1433\u20131438, 2008.\n\n[20] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana,\n\n51:61801, 2007.\n\n[21] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In\n\nAdvances in neural information processing systems, pages 1449\u20131456, 2007.\n\n[22] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using\n\nnondominated policies. In AAAI, 2010.\n\n[23] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In\nProceedings of the 23rd International Conference on Machine Learning, pages 729\u2013736. ACM,\n2006.\n\n[24] Dylan Had\ufb01eld-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 3909\u2013\n3917, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1142, "authors": [{"given_name": "Kareem", "family_name": "Amin", "institution": "Google Research"}, {"given_name": "Nan", "family_name": "Jiang", "institution": "Microsoft Research"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}]}