{"title": "Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making", "book": "Advances in Neural Information Processing Systems", "page_first": 4712, "page_last": 4720, "abstract": "It is commonly believed that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a Pareto optimal policy, i.e. a policy that cannot be improved upon for one principal without making sacrifices for another. Harsanyi's theorem shows that when the principals have a common prior on the outcome distributions of all policies, a Pareto optimal policy for the agent is one that maximizes a fixed, weighted linear combination of the principals\u2019 utilities. In this paper, we derive a more precise generalization for the sequential decision setting in the case of principals with different priors on the dynamics of the environment. We refer to this generalization as the Negotiable Reinforcement Learning (NRL) framework. In this more general case, the relative weight given to each principal\u2019s utility should evolve over time according to how well the agent\u2019s observations conform with that principal\u2019s prior. To gain insight into the dynamics of this new framework, we implement a simple NRL agent and empirically examine its behavior in a simple environment.", "full_text": "Negotiable Reinforcement Learning for Pareto\n\nOptimal Sequential Decision-Making\n\nNishant Desai\n\nCenter for Human-Compatible AI\nUniversity of California, Berkeley\nnishantdesai@berkeley.edu\n\nAndrew Critch\n\nDepartment of EECS\n\nUniversity of California, Berkeley\n\ncritch@berkeley.edu\n\nStuart Russell\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nrussell@cs.berkeley.edu\n\nAbstract\n\nIt is commonly believed that an agent making decisions on behalf of two or more\nprincipals who have different utility functions should adopt a Pareto optimal policy,\ni.e. a policy that cannot be improved upon for one principal without making\nsacri\ufb01ces for another. Harsanyi\u2019s theorem shows that when the principals have a\ncommon prior on the outcome distributions of all policies, a Pareto optimal policy\nfor the agent is one that maximizes a \ufb01xed, weighted linear combination of the\nprincipals\u2019 utilities. In this paper, we derive a more precise generalization for the\nsequential decision setting in the case of principals with different priors on the\ndynamics of the environment. We refer to this generalization as the Negotiable\nReinforcement Learning (NRL) framework. In this more general case, the relative\nweight given to each principal\u2019s utility should evolve over time according to how\nwell the agent\u2019s observations conform with that principal\u2019s prior. To gain insight\ninto the dynamics of this new framework, we implement a simple NRL agent and\nempirically examine its behavior in a simple environment.\n\n1\n\nIntroduction\n\nIt has been argued that the \ufb01rst AI systems with generally super-human cognitive abilities will play a\npivotal decision-making role in directing the future of civilization [Bostrom, 2014]. If that is the case,\nan important question will arise: Whose values will the \ufb01rst super-human AI systems serve? Since\nsafety is a crucial consideration in developing such systems, assuming the institutions building them\ncome to understand the risks and the time investments needed to address them [Baum, 2016], they\nwill have a large incentive to cooperate in their design rather than racing under time-pressure to build\ncompeting systems [Armstrong et al., 2016].\nTherefore, consider two nations\u2014allies or adversaries\u2014who must decide whether to cooperate in the\ndeployment of an extremely powerful AI system. Implicitly or explicitly, the resulting system would\nhave to strike compromises when con\ufb02icts arise between the wishes of those nations. How can they\nspecify the degree to which that system would be governed by the distinctly held principles of each\nnation? More mundanely, suppose a couple purchases a domestic robot. How should the robot strike\ncompromises when con\ufb02icts arise between the commands of its owners?\nIt is already an interesting and dif\ufb01cult problem to robustly align an AI system\u2019s values with those\nof a single human (or a group of humans in close agreement). Inverse reinforcement learning (IRL)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[Russell, 1998] [Ng and Russell, 2000] [Abbeel and Ng, 2004] and cooperative inverse reinforcement\nlearning (CIRL) [Had\ufb01eld-Menell et al., 2016] represent successively realistic early approaches to\nthis problem. But supposing some adequate solution eventually exists for aligning the values of a\nmachine intelligence with a single human decision-making unit, how should the values of a system\nserving multiple decision-makers be \u201caligned\u201d?\nIn the present work, we attempt to begin answering this question. We begin by observing some\nde\ufb01ciencies of optimizing a \ufb01xed, linear weighted sum of principals\u2019 utilities, as prescribed by\nHarsanyi\u2019s social aggregation theorem, in the case that those principals have differing beliefs about\nthe probability distributions dictating the agent\u2019s observations. We show that building a decision\nagent whose policy optimizes such an objective is not, in general, ex-ante Pareto optimal (i.e. \"as\nevaluated before the agent has taken any actions\"). Intuitively, linear weighted aggregation fails\nbecause, before the jointly owned agent has taken any actions, each principal evaluates its policy with\nrespect to their own beliefs, meaning that a policy that selectively prefers one principal over the other\nconditioned on its observations can be desirable to both principals.\nSection 3 addresses the shortcomings of Harsanyi-style preference aggregation by presenting the\nNegotiable Reinforcement Learning (NRL) framework. In this domain, we model each principal\u2019s\nprior on the environment and utility function as a Partially Observable Markov Decision Process\n(POMDP). We place necessary and suf\ufb01cient conditions on Pareto optimality for an agent acting over\nthese POMDPs with policy \u03c0. We then construct a third POMDP and show that the optimal policy for\nthis single POMDP satis\ufb01es the conditions for Pareto optimality. We refer to an agent implementing\na policy that solves this reduced POMDP as a NRL agent.\nFollowing directly from this reduction is the intriguing property that a Pareto optimal policy must,\nover time, prefer the utility function of the principal whose beliefs are a better predictor of the agent\u2019s\nobservations. This counter-intuitive result constitutes the main theorem of this paper. This can be\nseen as settling a kind of bet between the two principals: whichever principal makes the correct\npredictions gets to have their utility prioritized. In Section 4, we implement a simple NRL agent and\nmake empirical observations of this bet settling behavior.\n\n2 Related work\n\nSocial choice theory. The entirety of social choice theory and voting theory may be viewed as\nan attempt to specify an agreeable formal policy to enact on behalf of a group. Harsanyi\u2019s utility\naggregation theorem [Harsanyi, 1980] suggests one form of solution: maximizing a linear combination\nof group members\u2019 utility functions. The present work shows that this solution is inappropriate when\nprincipals have different beliefs, and Theorem 4 may be viewed as an extension of Harsanyi\u2019s form\nthat accounts simultaneously for differing priors and the prospect of future observations. Indeed,\nHarsanyi\u2019s form follows as a direct corollary of Theorem 4 when principals do share the same beliefs.\n\nMulti-agent systems. Zhang and Shah [2014] may be considered a sequential decision-making\napproach to social choice: they use MDPs to represent the decisions of principals in a competitive\ngame, and exhibit an algorithm for the principals that, if followed, arrives at a Pareto optimal Nash\nequilibrium satisfying a certain fairness criterion. Among the literature surveyed here, this paper\nis the closest to the present work in terms of its intended application: roughly speaking, achieving\nmutually desirable outcomes via sequential decision-making. However, the work is concerned with\nan ongoing interaction between the principals, rather than selecting a policy for a single agent to\nfollow as in this paper.\n\nMulti-objective sequential decision-making. There is also a good deal of work on Multi-\nObjective Optimization (MOO) [Tzeng and Huang, 2011], including for sequential decision-making,\nwhere solution methods have been called Multi-Objective Reinforcement Learning (MORL). For\ninstance, G\u00e1bor et al. [1998] introduce a MORL method called Pareto Q-learning for learning a set\nof a Pareto optimal polices for a Multi-Objective MDP (MOMDP). Soh and Demiris [2011] de\ufb01ne\nMulti-Reward Partially Observable Markov Decision Processes (MR-POMDPs), and use genetic\nalgorithms to produce non-dominated sets of policies for them. Roijers et al. [2015] refer to the same\nproblems as Multi-objective POMDPS (MOPOMDPs), and provide a bounded approximation method\nfor the optimal solution set for all possible weightings of the objectives. Wang [2014] surveys MORL\n\n2\n\n\fmethods, and contributes Multi-Objective Monte-Carlo Tree Search (MOMCTS) for discovering\nmultiple Pareto optimal solutions to a multi-objective optimization problem.\nHowever, none of these or related works address scenarios where the objectives are derived from\nprincipals with differing beliefs, from which the priority-shifting phenomenon of Theorem 4 arises.\nDiffering beliefs are likely to play a key role in negotiations, so for that purpose, the formulation of\nmulti-objective decision-making adopted here is preferable.\n\n3 Negotiable Reinforcement Learning\n\nConsider, informally, a scenario wherein two principals \u2014 perhaps individuals, companies, or states\n\u2014 are considering cooperating to build or otherwise obtain a machine that will then interact with an\nenvironment on their behalf.1 In such a scenario, the principals will tend to bargain for \u201chow much\u201d\nthe machine will prioritize their separate interests, so to begin, we need some way to quantify \u201chow\nmuch\u201d each principal is prioritized.\nFor instance, one might model the machine as maximizing the expected value, given its observations,\nof some utility function U of the environment that equals a weighted sum\n\nw(1)U (1) + w(2)U (2)\n\n(1)\nof the principals\u2019 individual utility functions U (1) and U (2), as Harsanyi\u2019s social aggregation theorem\nrecommends [Harsanyi, 1980]. Then the bargaining process could focus on choosing the values of\nthe weights w(i).\nHowever, this turns out to be a bad idea. As we shall see in the following example, this solution form\nis not generally compatible with Pareto optimality when agents have different beliefs. Harsanyi\u2019s\nsetting does not account for agents having different priors, nor for decisions being made sequentially,\nafter future observations. In such a setting, we need a new form of solution, exhibited here.\n\nA cake-splitting scenario. Alice (principal 1) and Bob (principal 2) are about to be presented with\na cake which they can choose to split in half to share, or give entirely to one of them. They have\n(built or purchased) a robot that will make the cake-splitting decision on their behalf. Alice\u2019s utility\nfunction returns 0 if she gets no cake, 20 if she gets half a cake, or 30 if she gets a whole cake. Bob\u2019s\nutility function works similarly.\nHowever, Alice and Bob have slightly different beliefs about how the environment works. They both\nagree on the state of the environment that the robot will encounter at \ufb01rst: a room with a cake in\nit (s1 = \u201ccake\u201d). But Alice and Bob have different predictions about how the robot\u2019s sensors will\nperceive the cake: Alice thinks that when the robot perceives the cake, it is 90% likely to appear\nwith a red tint (o1 = \u201cred\"), and 10% likely to appear with a green tint (o1 = \u201cgreen\"), whereas Bob\nbelieves the exact opposite. In either case, upon seeing the cake, the robot will either give Alice the\nentire cake (a1 = (all, none)), split the cake half-and-half (a1 = (half, half)), or give Bob the entire\ncake (a1 = (none, all)). Moreover, Alice and Bob have common knowledge of all these facts.\nNow, consider the following Pareto optimal policy that favors Alice (principal 1) when o1 is red, and\nBob (principal 2) when o1 is green:\n\n\u02c6\u03c0(\u2212 | red) = 100%(all, none)\n\u02c6\u03c0(\u2212 | green) = 100%(none, all)\n\nThis policy can be viewed intuitively as a bet between Alice and Bob about the value of o1, and is\nhighly appealing to both principals:\n\nE(1)[U (1); \u02c6\u03c0] = 90%(30) + 10%(0) = 27\nE(2)[U (2); \u02c6\u03c0] = 10%(0) + 90%(30) = 27\n\nIn particular, \u02c6\u03c0 is more appealing to both Alice and Bob than an agreement to deterministically split\nthe cake (half, half). We start to see that when principals evaluate their expected returns under a\npolicy \u03c0 with respect to differing beliefs about action outcomes, they may mutually agree to a policy\nthat favors one principal over the other, contingent on the action-observation history. We formalize\nthis intuition and explore its consequences in the following sections.\n\n1The results here all generalize from two principals to n principals being combined successively in any order,\n\nbut for clarity of exposition, the two person case is prioritized.\n\n3\n\n\f3.1 A POMDP formulation\n\nLet us formalize the machine\u2019s decision-making situation using the structure of a Partially Observable\nMarkov Decision Process (POMDP) [Sondik, 1973]. It is assumed that the principals will have\ncommon knowledge of the policy \u03c0 = (\u03c01, . . . , \u03c0n) they select for the machine to implement, but that\nthe principals may have different beliefs about how the environment works, and of course different\nutility functions. We refer to this as the common knowledge assumption.\nWe encode each principal j\u2019s outlook as a POMDP, D(j) = (S (j),A, T (j), U (j),O, \u2126(j), n), which\nsimultaneously represents that principal\u2019s beliefs about the environment, and the principal\u2019s utility\nfunction.\nS (j) is a set of possible states s of the environment.\nA is the set of possible actions a available to the NRL agent.\nT (j) is the conditional probabilities principal j believes will govern the environment state transitions,\n\ni.e., P(j)(si+1 | si, ai).\n\nU (j) is principal j\u2019s utility function from sequences of environmental states (s1, . . . , sn) to R. 2\nO is the set of possible observations o of the NRL agent.\n\u2126(j) is the conditional probabilities principal j believes will govern the agent\u2019s observations, i.e.,\n\nP(j)(oi | si).\nn is the time horizon.\nThus, principal j\u2019s subjective probability of an outcome (\u00afs, \u00afo, \u00afa), for any \u00afs \u2208 (S (j))n, is given by a\nprobability distribution P(j) that takes \u03c0 as a parameter:\n\nP(j)(\u00afs, \u00afo, \u00afa; \u03c0) := P(j)(s1) \u00b7 n(cid:89)\n\nP(j)(oi | si) \u03c0(ai | o\u2264ia<i) P(j)(si+1 | si, ai)\n\n(2)\n\ni=1\n\nWe say that the POMDPs D(1) and D(2) are compatible if any policy for one may be viewed as\na policy for the other, i.e., they have the same set of actions A and observations O, and the same\nnumber of time steps n. By the common knowledge assumption, principals\u2019 outlooks are assumed to\nbe encoded by a set of compatible POMDPs.\n\n3.2 Pareto optimal policies\n\nIn this context, where a policy \u03c0 may be evaluated relative to more than one POMDP, we use\nsuperscripts to represent which POMDP is governing the probabilities and expectations, e.g.,\n\nE(j)\n\u03c0 [U (j)] :=\n\nP(j)(\u00afs; \u03c0)U (j)(\u00afs)\n\n(cid:88)\n\n\u00afs\u2208(S (j))n\n\nrepresents the expectation in D(j) of the utility function U (j), assuming policy \u03c0 is followed.\nDe\ufb01nition 1 (Pareto optimal policies). A policy \u03c0 is Pareto optimal for a compatible pair of POMDPs\n(D(1), D(2)) if for any other policy \u03c0(cid:48), either\n\n\u03c0 [U (1)] \u2265 E(1)\nE(1)\n\n\u03c0(cid:48) [U (1)] or E(2)\n\n\u03c0 [U (2)] \u2265 E(2)\n\n\u03c0(cid:48) [U (2)].\n\nIt is assumed that, during negotiation, the principals will be seeking a Pareto optimal policy for the\nagent to follow, relative to the POMDPs D(1) and D(2) describing each principal\u2019s outlook.\nIf we allow the policy \u03c0 to come from the space of stochastic policies, mapping action-observation\nhistories to distributions over actions:\n\n(cid:16) \u00afA, \u00afO(cid:17) (cid:55)\u2192 \u2206A,\n\n\u03c0 :\n\nthen we can show that the following condition is necessary and suf\ufb01cient for \u03c0 to be Pareto optimal.\n\n2For the sake of generality, U (j) is not assumed to be stationary, as reward functions often are.\n\n4\n\n\fLemma 2. A policy \u03c0\u2217 is Pareto optimal to principals 1 and 2 if and only if there exist weights\nw1, w2 \u2265 0 with w1 + w2 = 1 such that\n\u03c0\u2217 \u2208 argmax\n\u03c0\u2208\u03a0\n\n\u03c0 [U (1)] + w(2)E(2)\n\nw(1)E(1)\n\n(cid:16)\n\n(cid:17)\n\n\u03c0 [U (2)]\n\n(3)\n\nProof. See supplementary material.\n\n3.3 Reduction to a single POMDP\n\nWe shall soon see that any Pareto optimal policy \u03c0 must favor, as time progresses, optimizing the utility\nof whichever principal\u2019s beliefs were a better predictor of the NRL agent\u2019s inputs. This phenomenon\nis most easily shown via a reduction of the POMDPs describing the outlooks of principals 1 and 2 to\na third POMDP, as follows.\nFor any weights, w(1), w(2) \u2265 0 with w(1) + w(2) = 1, we de\ufb01ne a new POMDP that works by\n\ufb02ipping a (w(1), w(2))-weighted coin, and then running D(1) or D(2) thereafter, according to the\ncoin \ufb02ip. Explicitly, we de\ufb01ne a POMDP mixture as a POMDP D = w(1)D(1) + w(2)D(2). We give\nD the same action space A and observation space O as the compatible POMDPs D(1) and D(2). In\nthe state space of D, we include a latent, stochastic, binary variable B \u2208 {1, 2} that is \ufb01xed for all\ntime and initial belief given by P(B = 1) = w(1) and P(B = 2) = w(2). In the case that B = 1, we\ndraw states, transition probabilities, observation probabilities, and utilities from the parameters of D1.\nFormally,\n\nPD(\u00afs, \u00afo, \u00afa | B = 1; \u03c0) = P(1)(\u00afs, \u00afo, \u00afa; \u03c0),\n\nand\n\n\u03c0 [U | B = 1] = E(1)\nE(D)\n\n\u03c0 [U 1].\n\nLikewise for B = 2. This is a generalized version of a well-known POMDP reduction used in the\nBayesian Reinforcement Learning literature [Ghavamzadeh et al., 2016].\nGiven any policy \u03c0, the expected payoff of \u03c0 in D = w1D1 + w2D2 is exactly\nP(B = 1) \u00b7 E\u03c0(U | B = 1) + P(B = 2) \u00b7 E\u03c0(U | B = 2)\n\n= w(1)E(1)\n\n\u03c0 [U (1)] + w(2)E(2)\n\n\u03c0 [U (2)]\n\n(4)\n(5)\n\nTherefore, using the above de\ufb01nitions, Lemma 2 may be restated in the following equivalent form:\nLemma 3. Given a pair (D(1), D(2)) of compatible POMDPs, a policy \u03c0 is Pareto optimal for that\npair if and only if there exist weights w(j) such that \u03c0 is an optimal policy for the single POMDP\ngiven by w(1)D(1) + w(2)D(2).\n\n3.4 Structural properties of Pareto optimal POMDP reduction\n\nExpressed in the form of Equation 3, it might not be clear how a Pareto optimal policy makes use of its\nobservations over time, aside from storing them in memory. For example, is there any sense in which\nthe NRL agent carries \u201cbeliefs\" about the environment that it \u201cupdates\" at each time step? Lemma 3\nallows us to answer this and related questions by translating theorems about single POMDPs into\ntheorems about compatible pairs of POMDPs.\nWe introduce the notation hi to represent the NRL agent\u2019s action-observation history at time i, i.e.\nhi = (o\u2264i, a<i). At timestep i, the NRL agent has a belief over the latent value B, conditioned on hi.\nNote that given an action-observation history hi, the NRL agent\u2019s posterior belief over the value of B\nis proportional to the probabilities assigned by each principal\u2019s outlook to the realized observation\nsequence:\n\nP(B = j|hi) \u221d w(j)P(j)(o\u2264i | a<i).\n\nThis relation, combined with the expected value expression in Equation 4, reveals a pattern in how\nthe weights on the principals\u2019 conditionally expected utilities must change over time, which is the\nmain result of this paper:\n\n5\n\n\fTheorem 4 (Pareto optimal policy recursion). Given a pair (D(1), D(2)) of compatible POMDPs of\nlength n, a policy \u03c0 is Pareto optimal if and only if its components \u03c0i for i \u2264 n satisfy the following\nbackward recursion for some pair of weights w(1), w(2) \u2265 0 with w(1) + w(2) = 1:\n\nw(1)P(1)(cid:16)\n(cid:16)\n+ w(2)P(2)(cid:16)\n\n\u03c0(hi) \u2208 argmax\n\u03b1\u2208\u2206A\n\no\u2264i | a<i\no\u2264i | a<i\nIn words, to achieve Pareto optimality, the machine must\n\n(cid:17) E(1)\n(cid:17) E(2)\n\n\u03c0 [U (1) | hi, ai \u223c \u03b1]\n\u03c0 [U (2) | hi, ai \u223c \u03b1]\n\n(cid:17)\n\n1. use each principal\u2019s own beliefs when estimating the degree to which a decision favors that\n\nprincipal\u2019s utility function, and\n\n2. shift the relative priorities of the principals\u2019 expected utilities in the machine\u2019s decision\nobjective over time, by a factor proportional to how well the principals predict the machine\u2019s\ninputs.\n\nProof. By Lemma 3, the Pareto-optimality of \u03c0 for (D(1), D(2)) is equivalent to its classical opti-\nmality for D = w(1)D(1) + w(2)D(2) for some (w(1), w(2)). Writing P for probabilities in D, this is\nequivalent to \u03b1 = \u03c0(hi) maximizing the following expression F (\u03b1) for each i \u2208 {0, . . . , n}:\n\nF (\u03b1) = E(D)\n\n\u03c0 [U | hi, ai \u223c \u03b1].\n\n(6)\n\nThe above property states, in words, that the optimal policy is the one that maximizes future expected\nrewards given an observation history, without regard to any alternate histories and by recursively\nassuming the same selection process for future timesteps. This is a standard formulation for POMDPs,\nand is exactly Bellman\u2019s Principle of Optimality [Bellman, 1957, Chap III, 3.]\nThe tower property of expectation allows us to write the above expectation factor as\n\nE(D)\u03c0[U | hi, ai \u223c \u03b1] =P(B = 1 | hi)E(D)\u03c0[U | hi, ai \u223c \u03b1, B = 1]\n+P(B = 2 | hi)E(D)\u03c0[U | hi, ai \u223c \u03b1, B = 2].\n\n(7)\n\nNow, observe that, by Bayes\u2019 rule, the posterior probability\n\n(8)\nBy construction, E(D)\u03c0[U | hi, ai \u223c \u03b1, B = j] = E(j)\u03c0[U (j) | hi, ai \u223c \u03b1]. Substituting this\nexpression and (8) into Equation 7 gives us the result.\n\nP(B = j|hi) \u221d w(j)P(j)(o\u2264i | a<i).\n\nAn intuition about this property is gained by noting that as the NRL agent takes actions in the single\nPOMDP w(1)D(1) + w(2)D(2), its posterior belief about the value of the latent variable B is exactly\nequal to its belief about which utility function is \u201ccorrect\u201d for the POMDP it is acting in.\nWhen the principals have the same beliefs, they always assign the same probability to the machine\u2019s\ninputs, so the weights on their respective expectations do not change over time. In this case, Harsanyi\u2019s\nutility aggregation formula is recovered as a special instance.\n\n3.5\n\nInterpretation as Bet Settling\n\nTheorem 4 shows that a Pareto optimal policy must tend, over time, toward prioritizing the expected\nutility of whichever principal\u2019s beliefs best predict the machine\u2019s inputs better. From some perspec-\ntives, this is a little counterintuitive: not only must the machine gradually place more predictive weight\non whichever principal\u2019s prior is a better predictor, but it must reward that principal by attending\nmore to their utility function as well.\nThus, a machine implementing a Pareto optimal policy can be viewed as a kind of bet-settling device.\nIf Alice is 90% sure the Four Horsemen will appear tomorrow and Bob is 80% sure they won\u2019t, it\nmakes sense for Alice to ask\u2014while bargaining with Bob for the machine\u2019s policy\u2014that the machine\n\n6\n\n\fFigure 1: The agent initially heads for Goal 2 in the top-right corner. However, at Frame 9 it has\nobserved eight deterministic transitions in a row. Its con\ufb01dence in MDP 1 is high enough at that point\nthat it veers downward and heads to Goal 1 Shown: Frames 0, 6, 12, and 15 of the trajectory.\n\nprioritize her values more if the Four Horsemen arrive tomorrow, in exchange for prioritizing Bob\u2019s\nvalues more if they don\u2019t. Both parties will be happy with this agreement in expectation.\n\n4 Evaluation\n\n4.1 Experiment Environment\n\nOur experiments are run in a modi\ufb01ed version of the FrozenLake environment in OpenAI Gym\n[Brockman et al., 2016]. FrozenLake is a grid world environment that simulates a goal MDP. The\nagent receives a reward upon reaching the speci\ufb01ed goal position. The agent can choose to move\nNORTH, SOUTH, EAST, or WEST, and the transition model can be chosen to be either stochastic or\ndeterministic. Under a stochastic transition model, simulating the eponymous frozen lake, the agent\u2019s\naction fails with probability 0.2, and the agent transitions into one of the unintended neighboring\npositions. In the FrozenLake environment, state is fully observed by the agent. We modify the\nenvironment to support multiple possible goal states, labeled 1 and 2, corresponding to the utilities of\nPrincipal 1 and 2, respectively. For simplicity, we treat the state as fully-observable.\nIn these experiments Principal 1 assigns utility to the agent reaching each goal labeled 1 and has the\nbelief that the environment has a deterministic transition model. Principal 2 assigns utility to the\nagent reaching each goal labeled 2 and has the belief that the environment has a stochastic transition\nmodel. The agent is initialized with initial belief state w1, corresponding to a subjective belief that the\nagent is in Principal 1\u2019s MDP, M1, with probability w1 and Principal 2\u2019s MDP, M2, with probability\n1 \u2212 w1 = w2. We use point-based value iteration to learn a the belief-space policy. The agent is then\nplaced in either M1 or M2, and we observe over time as the agent acts according to its belief state.\n\n4.2 Experiments\n\nObserved Behavior\nIn this set of experiments, we run the NRL agent in order to verify that its\nbehavior resembles a type of bet-settling. The true environment is chosen to be deterministic for this\nexperiment. After running point-based value iteration [Pineau et al., 2003] with a belief set of 331\npoints, we execute the resulting policy in this environment. A portion of the agent\u2019s trajectory is\nseen in Figure 1. The purple arrows represent the agent\u2019s choice of action at each physical position\nunder the current belief state. The color of each square represents the agent\u2019s subjective value at each\nposition under the current belief.\nObserving the trajectory, we see that it initially moves towards goal 2. However, each time an action\nsucceeds, the agent\u2019s belief in the stochastic environment decreases. By the ninth frame, the agent\u2019s\nbelief in the stochastic world, and as a result its belief that goal 2 grants reward, is low enough that\nthe policy shifts to push the agent to goal 1. This is the type of behavior we would expect of an agent\nthat maximizes each principal\u2019s utility based on the likelihood of their beliefs.\nNext, we use the same policy and place the agent in a stochastic world. The very \ufb01rst action results in\na stochastic transition, and the likelihood of Principal 1\u2019s belief immediately falls to 0. At that point,\nthe agent knows it is in Principal 2\u2019s MDP and believes that goal 2 will give it reward. It heads to\ngoal 2 accordingly.\n\n7\n\n\fFigure 2: Left: Principal 1\u2019s predicted expected reward while in MDP 2. Right: Principal 2\u2019s predicted\nexpected reward during trajectory depicted in MDP 1.\n\nFull trajectories for both of these experiments are presented in the supplementary material.\n\nSubjective State Value During Execution The promise of compromise behavior may bring parties\ninto agreement over the decision to build a NRL agent. However if, after the agent begins taking\nactions, parties feel that the agent will not take actions that guarantee them high future rewards, those\nparties may be tempted to end cooperation. The next experiment attempts to test how losing parties\nassess the agent\u2019s behavior during execution.\nTo address this concern, we turn back to the trajectories discussed in the previous section. In the\n\ufb01rst trajectory, the agent is, in fact, operating in Principal 1\u2019s MDP. At each timestep, we ask what\nPrincipal 2 believes their expected sum of future rewards to be. Recall that each principal believes\nthe agent is acting in their own MDP. Since both principals know the agent\u2019s policy, physical state,\nand belief state at each timestep, each principal can estimate their future rewards by simulating the\nagent acting in their respective MDP with initial con\ufb01guration given by the current con\ufb01guration.\nFor the experiments here, we rollout 100 simulations from each con\ufb01guration encountered along\nthe two trajectories. For each simulation, we place the agent within the losing principal\u2019s MDP and\nmeasure the average reward attained by the agent from that starting con\ufb01guration. Principal 2\u2019s\nassessment of their expected future rewards during execution in MDP 1 is shown in the right frame of\nFigure 2. Principal 1\u2019s assessment of their expected future rewards during execution in MDP 2 is\nshown in the left frame.\nWe see that until frame 9, Principal 2 believes that the agent has a high chance of reaching goal 2. It\nis only when the agent\u2019s behavior shifts and it veers towards goal 1 that Principal 2 begins to assess\ntheir expected future utility as decreasing sharply. In contrast, as soon as Principal 1 observes that\nw1 = 0.0, they realize that the agent will never reach goal 1. Both of these assessments are consistent\nwith the bet-settling behavior we expect.\n\n5 Conclusion\n\nInsofar as Theorem 4 is not particularly mathematically sophisticated\u2014it employs only basic facts\nabout convexity and linear algebra\u2014this suggests there may be more low-hanging fruit to be found in\nthe domain of \u201cmachine implementable social choice theory.\" To recapitulate, Theorem 4 represents\ntwo deviations from the intuition of na\u00efve utility aggregation: to achieve Pareto optimality for\nprincipals with differing beliefs, an agent must (1) use each principal\u2019s own beliefs in evaluating how\nwell an action will serve that principal\u2019s utility function, and (2) shift the relative priority it assigns\nto each principal\u2019s expected utilities over time by a factor proportional to how well that principal\u2019s\nbeliefs predict the machine\u2019s inputs.\nAs a \ufb01nal remark, consider that social choice theory and bargaining theory were both pioneered during\nthe Cold War, when it was particularly compelling to understand the potential for cooperation between\nhuman institutions that might behave competitively. In the coming decades, machine intelligence\nwill likely bring many new challenges for cooperation, as well as new means to cooperate, and new\nreasons to do so. As such, new technical aspects of social choice and bargaining, along the lines of\nthis paper, will likely continue to emerge.\n\n8\n\n\fReferences\nPieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In\nProceedings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM, 2004.\n\nStuart Armstrong, Nick Bostrom, and Carl Shulman. Racing to the Precipice: A Model of Arti\ufb01cial\n\nIntelligence Development. AI & SOCIETY, 31(2):201\u2013206, 2016.\n\nSeth D Baum. On the Promotion of Safe and Socially Bene\ufb01cial Arti\ufb01cial Intelligence. AI &\n\nSOCIETY, pages 1\u20139, 2016.\n\nRichard Bellman. Dynamic Programming. Princeton University Press, 1957.\n\nNick Bostrom. Superintelligence: Paths, dangers, strategies. OUP Oxford, 2014.\n\nGreg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. OpenAI Gym, 2016.\n\nZolt\u00e1n G\u00e1bor, Zsolt Kalm\u00e1r, and Csaba Szepesv\u00e1ri. Multi-Criteria Reinforcement Learning. In ICML,\n\nvolume 98, pages 197\u2013205, 1998.\n\nMohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian Reinforcement\n\nLearning: A Survey. ArXiv e-prints, September 2016.\n\nDylan Had\ufb01eld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative Inverse\n\nReinforcement Learning, 2016.\n\nJohn C Harsanyi. Cardinal Welfare, Individualistic Ethics, and Interpersonal Comparisons of Utility.\n\nIn Essays on Ethics, Social Behavior, and Scienti\ufb01c Explanation, pages 6\u201323. Springer, 1980.\n\nAndrew Y Ng and Stuart J Russell. Algorithms for Inverse Reinforcement Learning. In ICML, pages\n\n663\u2013670, 2000.\n\nJoelle Pineau, Geoff Gordon, and Sebastian Thrun. Point-based Value Iteration: An Anytime\n\nAlgorithm for POMDPs. IJCAI\u201903, pages 1025\u20131030, 2003.\n\nDiederik M Roijers, Shimon Whiteson, and Frans A Oliehoek. Point-based planning for multi-\nobjective pomdps. In IJCAI 2015: Proceedings of the Twenty-Fourth International Joint Conference\non Arti\ufb01cial Intelligence, pages 1666\u20131672, 2015.\n\nStuart Russell. Learning Agents for Uncertain Environments. In Proceedings of the eleventh annual\n\nconference on Computational learning theory, pages 101\u2013103. ACM, 1998.\n\nHarold Soh and Yiannis Demiris. Evolving Policies for Multi-Reward Partially Observable Markov\nDecision Processes (MR-POMDPs). In Proceedings of the 13th annual conference on Genetic and\nevolutionary computation, pages 713\u2013720. ACM, 2011.\n\nEdward J. Sondik. The Optimal Control of Partially Observable Markov Processes over a Finite\n\nHorizon. Oper. Res., 21(5):1071\u20131088, October 1973.\n\nGwo-Hshiung Tzeng and Jih-Jeng Huang. Multiple Attribute Decision Making: Methods and\n\nApplications. CRC press, 2011.\n\nWeijia Wang. Multi-Objective Sequential Decision Making. PhD thesis, Universit\u00e9 Paris Sud-Paris\n\nXI, 2014.\n\nChongjie Zhang and Julie A Shah. Fairness in Multi-Agent Sequential Decision-Making. In Advances\n\nin Neural Information Processing Systems, pages 2636\u20132644, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2287, "authors": [{"given_name": "Nishant", "family_name": "Desai", "institution": "UC Berkeley"}, {"given_name": "Andrew", "family_name": "Critch", "institution": "UC Berkeley"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}]}