{"title": "A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment", "book": "Advances in Neural Information Processing Systems", "page_first": 7869, "page_last": 7880, "abstract": "Empowerment is an information-theoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent's control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states.\nWe propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman\u2019s optimality principle as well as recent information-theoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop off-policy actor-critic RL algorithms which we validate in high-dimensional continuous robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to model-free state-of-the-art techniques.", "full_text": "A Uni\ufb01ed Bellman Optimality Principle Combining\n\nReward Maximization and Empowerment\n\nFelix Leibfried, Sergio Pascual-D\u00edaz, Jordi Grau-Moya\n\nPROWLER.io\nCambridge, UK\n\n{felix,sergio.diaz,jordi}@prowler.io\n\nAbstract\n\nEmpowerment is an information-theoretic method that can be used to intrinsically\nmotivate learning agents. It attempts to maximize an agent\u2019s control over the\nenvironment by encouraging visiting states with a large number of reachable\nnext states. Empowered learning has been shown to lead to complex behaviors,\nwithout requiring an explicit reward signal. In this paper, we investigate the use\nof empowerment in the presence of an extrinsic reward signal. We hypothesize\nthat empowerment can guide reinforcement learning (RL) agents to \ufb01nd good\nearly behavioral solutions by encouraging highly empowered states. We propose a\nuni\ufb01ed Bellman optimality principle for empowered reward maximization. Our\nempowered reward maximization approach generalizes both Bellman\u2019s optimality\nprinciple as well as recent information-theoretical extensions to it. We prove\nuniqueness of the empowered values and show convergence to the optimal solution.\nWe then apply this idea to develop off-policy actor-critic RL algorithms which\nwe validate in high-dimensional continuous robotics domains (MuJoCo). Our\nmethods demonstrate improved initial and competitive \ufb01nal performance compared\nto model-free state-of-the-art techniques.\n\n1\n\nIntroduction\n\nIn reinforcement learning [62] (RL), agents identify policies to collect as much reward as possible in\na given environment. Recently, leveraging parametric function approximators has led to tremendous\nsuccess in applying RL to high-dimensional domains such as Atari games [40] or robotics [56]. In\nsuch domains, inspired by the policy gradient theorem [63, 13], actor-critic approaches [36, 41] attain\nstate-of-the-art results by learning both a parametric policy and a value function.\nEmpowerment is an information-theoretic framework where agents maximize the mutual information\nbetween an action sequence and the state that is obtained after executing this action sequence from\nsome given initial state [26, 27, 53]. It turns out that the mutual information is highest for such initial\nstates where the number of reachable next states is largest. Policies that aim for high empowerment\ncan lead to complex behavior, e.g. balancing a pole in the absence of any explicit reward signal [23].\nDespite progress on learning empowerment values with function approximators [42, 12, 49], there has\nbeen little attempt in the combination with reward maximization, let alone in utilizing empowerment\nfor RL in the high-dimensional domains it has become applicable just recently. We therefore propose\na uni\ufb01ed principle for reward maximization and empowerment, and demonstrate that empowered\nsignals can boost RL in large-scale domains such as robotics. In short, our contributions are:\n\n\u2022 a generalized Bellman optimality principle for joint reward maximization and empowerment,\n\u2022 a proof for unique values and convergence to the optimal solution for our novel principle,\n\u2022 empowered actor-critic methods boosting RL in MuJoCo compared to model-free baselines.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Background\n\n2.1 Reinforcement Learning\nIn the discrete RL setting, an agent, being in state s \u2208 S, executes an action a \u2208 A according to a\nbehavioral policy \u03c0behave(a|s) that is a conditional probability distribution \u03c0behave : S \u00d7 A \u2192 [0, 1].\nThe environment, in response, transitions to a successor state s\ufffd \u2208 S according to a (probabilistic)\nstate-transition function P(s\ufffd|s, a), where P : S \u00d7 A \u00d7 S \u2192 [0, 1]. Furthermore, the environment\ngenerates a reward signal r = R(s, a) according to a reward function R : S \u00d7 A \u2192 R. The\nagent\u2019s aim is to maximize its expected future cumulative reward with respect to the behavioral policy\nmax\u03c0behave E\u03c0behave,P [\ufffd\u221et=0 \u03b3trt], with t being a time index and \u03b3 \u2208 (0, 1) a discount factor. Optimal\n\nexpected future cumulative reward values for a given state s obey then the following recursion:\n\n(1)\nreferred to as Bellman\u2019s optimality principle [4], where V \ufffd and Q\ufffd are the optimal value functions.\n\na \ufffdR(s, a) + \u03b3EP(s\ufffd|s,a) [V \ufffd(s\ufffd)]\ufffd =: max\n\nV \ufffd(s) = max\n\nQ\ufffd(s, a),\n\na\n\n2.2 Empowerment\n\n\u03c0empower\n\nE\ufffd(s) = max\n\u03c0empower\n\nEmpowerment is an information-theoretic method where an agent executes a sequence of k actions\n\ufffda \u2208 Ak when in state s \u2208 S according to a policy \u03c0empower(\ufffda|s) which is a conditional probability\ndistribution \u03c0empower : S \u00d7 Ak \u2192 [0, 1]. This is slightly more general than in the RL setting where\nonly a single action is taken upon observing a certain state. The agent\u2019s aim is to identify an optimal\n\nE\u03c0empower(\ufffda|s)P (k)(s\ufffd|s,\ufffda)\ufffdlog\n\nand the state s\ufffd to which the environment transitions after executing \ufffda in s, formulated as:\n\npolicy \u03c0empower that maximizes the mutual information I\ufffd \ufffdA, S\ufffd\ufffd\ufffd\ufffds\ufffd between the action sequence \ufffda\n\nI\ufffd \ufffdA, S\ufffd\ufffd\ufffd\ufffds\ufffd = max\nHere, E\ufffd(s) refers to the optimal empowerment value and P (k)(s\ufffd|s, \ufffda) to the probability of tran-\nsitioning to s\ufffd after executing the sequence \ufffda in state s, where P (k) : S \u00d7 Ak \u00d7 S \u2192 [0, 1].\nImportantly, p(\ufffda|s\ufffd, s) = P (k)(s\ufffd|s,\ufffda)\u03c0empower(\ufffda|s)\n\ufffd\ufffda P (k)(s\ufffd|s,\ufffda)\u03c0empower(\ufffda|s) is the inverse dynamics model of \u03c0empower. The\nimplicit dependency of p on the optimization argument \u03c0empower renders the problem non-trivial.\nFrom an information-theoretic perspective, optimizing for empowerment is equivalent to maxi-\nmizing the capacity [58] of an information channel P (k)(s\ufffd|s, \ufffda) with input \ufffda and output s\ufffd w.r.t.\nthe input distribution \u03c0empower(\ufffda|s), as outlined in the following [11, 10]. De\ufb01ne the functional\nIf (\u03c0empower,P (k), q) := E\u03c0empower(\ufffda|s)P (k)(s\ufffd|s,\ufffda)\ufffdlog q(\ufffda|s\ufffd,s)\n\u03c0empower(\ufffda|s)\ufffd, where q is a conditional probabil-\nity q : S \u00d7 S \u00d7 Ak \u2192 [0, 1]. Then the mutual information is recovered as a special case of If with\nI\ufffd \ufffdA, S\ufffd\ufffd\ufffd\ufffds\ufffd = maxq If (\u03c0empower,P (k), q) for a given \u03c0empower. The maximum argument\n\np(\ufffda|s\ufffd, s)\n\n\u03c0empower(\ufffda|s)\ufffd .\n\n(2)\n\nis the true Bayesian posterior p(\ufffda|s\ufffd, s)\u2014see [10] Lemma 10.8.1 for details. Similarly, maximizing\nIf (\u03c0empower,P (k), q) with respect to \u03c0empower for a given q leads to:\n\nAs explained e.g. in [10] page 335 similar to [46]. The above yields the subsequent proposition.\n\nProposition 1 Maximum Channel Capacity. Iterating through Equations (3) and (4) by computing q\nempower) that\ngiven \u03c0empower and vice versa in an alternating fashion converges to an optimal pair (q\ufffd, \u03c0\ufffd\nempower,P (k), q\ufffd). The convergence\nempower with support in Ak \u2200s\u2014\n\nmaximizes the mutual information max\u03c0empower I\ufffd \ufffdA, S\ufffd\ufffd\ufffd\ufffds\ufffd = If (\u03c0\ufffd\n\nrate is O(1/N ), where N is the number of iterations, for any initial \u03c0ini\nsee [10] Chapter 10.8 and [11, 16]. This is known as Blahut-Arimoto algorithm [2, 7].\n\nRemark. Empowerment is similar to curiosity concepts of predictive information that focus on the\nmutual information between the current and the subsequent state [6, 48, 69, 61, 43, 54].\n\n2\n\nq\ufffd(\ufffda|s\ufffd, s) = P (k)(s\ufffd|s, \ufffda)\u03c0empower(\ufffda|s)\n\ufffd\ufffda P (k)(s\ufffd|s, \ufffda)\u03c0empower(\ufffda|s)\nexp\ufffdEP (k)(s\ufffd|s,\ufffda) [log q(\ufffda|s\ufffd, s)]\ufffd\n\ufffd\ufffda exp\ufffdEP (k)(s\ufffd|s,\ufffda) [log q(\ufffda|s\ufffd, s)]\ufffd .\n\n\u03c0\ufffd\nempower(\ufffda|s) =\n\n(3)\n\n(4)\n\n\f3 Motivation: Combining Reward Maximization with Empowerment\n\nThe Blahut-Arimoto algorithm presented in the previous section solves empowerment for low-\ndimensional discrete settings but does not readily scale to high-dimensional or continuous state-action\nspaces. While there has been progress on learning empowerment values with parametric function\napproximators [42], how to combine it with reward maximization or RL remains open. In principle,\nthere are two possibilities for utilizing empowerment. The \ufb01rst is to directly use the policy \u03c0\ufffd\nempower\nobtained in the course of learning empowerment values E\ufffd(s). The second is to train a behavioral\npolicy to take an action in each state such that the expected empowerment value of the next state\nis highest (requiring E\ufffd-values as a prerequisite). Note that the two possibilities are conceptually\ndifferent. The latter seeks states with a large number of reachable next states [23]. The \ufb01rst, on the\nother hand, aims for high mutual information between actions and the subsequent state, which is not\nnecessarily the same as seeking highly empowered states [42].\nWe hypothesize empowered signals to be bene\ufb01cial for RL, especially in high-dimensional environ-\nments and at the beginning of the training process when the initial policy is poor. In this work, we\ntherefore combine reward maximization with empowerment inspired by the two behavioral possibili-\nties outlined in the previous paragraph. Hence, we focus on the cumulative RL setting rather than the\nnon-cumulative setting that is typical for empowerment. We furthermore use one-step empowerment\nas a reference, i.e. k = 1, because cumulative one-step empowerment learning leads to high values\nin such states where the number of possibly reachable next states is high, and preserves hence the\noriginal empowerment intuition without requiring a multi-step policy\u2014see Section 4.3. The \ufb01rst idea\nis to train a policy that trades off reward maximization and learning cumulative empowerment:\n\nmax\n\u03c0behave\n\nE\u03c0behave,P\ufffd \u221e\ufffdt=0\n\n\u03b3t\ufffd\u03b1R(st, at) + \u03b2 log\n\np(at|st+1, st)\n\n\u03c0behave(at|st)\ufffd\ufffd ,\n\nwhere \u03b1 \u2265 0 and \u03b2 \u2265 0 are scaling factors, and p indicates the inverse dynamics model of \u03c0behave in\nline with Equation (3). Note that p depends on the optimization argument \u03c0behave, similar to ordinary\nempowerment, leading to a non-trivial Markov decision problem (MDP).\nThe second idea is to learn cumulative empowerment values a priori by solving Equation (5) with\n\u03b1 = 0 and \u03b2 = 1. The outcome of this is a policy \u03c0\ufffd\nempower (and its inverse dynamics model p) that\ncan be used to construct an intrinsic reward signal which is then added to the external reward:\n\nmax\n\u03c0behave\n\nE\u03c0behave,P\ufffd \u221e\ufffdt=0\n\n\u03b3t\ufffd\u03b1R(st, at) + \u03b2E\u03c0\ufffd\n\nempower(a|st)P(s\ufffd|st,a)\ufffdlog\n\np(a|s\ufffd, st)\n\u03c0\ufffd\n\nempower(a|st)\ufffd\ufffd\ufffd .\n\nImportantly, Equation (6) poses an ordinary MDP since the reward signal is merely extended by\nanother stationary state-dependent signal.\nBoth proposed ideas require to solve the novel MDP as speci\ufb01ed in Equation (5). In Section 4, we\ntherefore prove the existence of unique values and convergence of the corresponding value iteration\nscheme (including a grid world example). We also show how our formulation generalizes existing\nformulations from the literature. In Section 5, we carry our ideas over to high-dimensional continuous\nstate-action spaces by devising off-policy actor-critic-style algorithms inspired by the proposed MDP\nformulation. We evaluate our novel actor-critic-style algorithms in MuJoCo demonstrating better\ninitial and competitive \ufb01nal performance compared to model-free state-of-the-art baselines.\n\n(5)\n\n(6)\n\n4\n\nJoint Reward Maximization and Empowerment Learning in MDPs\n\nWe state our main theoretical result in advance, proven in the remainder of this section (an intuition\nfollows): the solution to the MDP from Equation (5) implies unique optimal values V \ufffd obeying the\nBellman recursion\n\nV \ufffd(s) = max\n\u03c0behave\n\np(at|st+1, st)\n\n\u03b3t\ufffd\u03b1R(st, at) + \u03b2 log\n\nE\u03c0behave,P\ufffd \u221e\ufffdt=0\nE\u03c0behave(a|s)\ufffd\u03b1R(s, a) + EP(s\ufffd|s,a)\ufffd\u03b2 log\nexp\ufffd \u03b1\n\n\u03c0behave(at|st)\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\nq(a|s\ufffd, s)\n\u03c0behave(a|s)\n\u03b2 R(s, a) + EP(s\ufffd|s,a)\ufffdlog q\ufffd(a|s\ufffd, s) +\n\u03b3\n\u03b2\n\ns0 = s\ufffd\n+ \u03b3V \ufffd(s\ufffd)\ufffd\ufffd\nV \ufffd(s\ufffd)\ufffd\ufffd ,\n\n= max\n\u03c0behave,q\n\n= \u03b2 log\ufffda\n\n(7)\n\n3\n\n\fwhere\n\n\u03c0\ufffd\nbehave(a|s) =\n\nbehave(a|s)\nis the inverse dynamics model of the optimal behavioral policy \u03c0\ufffd\n\nbehave(a|s)\n\nq\ufffd(a|s\ufffd, s) = P(s\ufffd|s, a)\u03c0\ufffd\n= p(a|s\ufffd, s)\n\ufffda P(s\ufffd|s, a)\u03c0\ufffd\nbehave that assumes the form:\n\u03b2R(s, a) + EP(s\ufffd|s,a)\ufffdlog q\ufffd(a|s\ufffd, s) + \u03b3\n\u03b2R(s, a) + EP(s\ufffd|s,a)\ufffdlog q\ufffd(a|s\ufffd, s) + \u03b3\n\n\u03b2 V \ufffd(s\ufffd)\ufffd\ufffd\n\u03b2 V \ufffd(s\ufffd)\ufffd\ufffd ,\n\nexp\ufffd \u03b1\n\ufffda exp\ufffd \u03b1\n\n(8)\n\n(9)\n\nwhere the denominator is just exp((1/\u03b2)V \ufffd(s)). While the remainder of this section explains how\nEquations (7) to (9) are derived in detail, it can be insightful to understand at a high level what makes\nour formulation non-trivial. The dif\ufb01culty is that the inverse dynamics model p = q\ufffd depends on the\noptimal policy \u03c0\ufffd\nbehavioral and vice versa leading to a non-standard optimal value identi\ufb01cation problem.\nProving the existence of V \ufffd-values and how to compute them poses therefore our main theoretical\ncontribution, and implies the existence of at least one (q\ufffd, \u03c0\ufffd\nbehave)-pair that satis\ufb01es the recursive\nrelationship of Equations (8) and (9). This proof is given in Section 4.1 and leads naturally to a value\niteration scheme to compute optimal values in practice. The convergence of this scheme is proven in\nSection 4.2 and we also demonstrate value learning in a grid world example\u2014see Section 4.3. In\nSection 4.4, we elucidate how our formulation generalizes and relates to existing MDP formulations.\n\n4.1 Existence of Unique Optimal Values\nFollowing the second line from Equation (7), let\u2019s de\ufb01ne the Bellman operator B\ufffd : R|S| \u2192 R|S| as\n(10)\n\nE\u03c0behave(a|s)\ufffd\u03b1R(s, a) + EP(s\ufffd|s,a)\ufffd\u03b2 log\n\n+ \u03b3V (s\ufffd)\ufffd\ufffd .\n\nB\ufffdV (s) := max\n\u03c0behave,q\n\nq(a|s\ufffd, s)\n\u03c0behave(a|s)\n\nTheorem 1 Existence of Unique Optimal Values. Assuming a bounded reward function R, the\noptimal value vector V \ufffd as given in Equation (7) exists and is a unique \ufb01xed point V \ufffd = B\ufffdV \ufffd of\nthe Bellman operator B\ufffd from Equation (10).\n\nq(a|s\ufffd, s)\n\u03c0behave(a|s)\n\nProof. The proof of Theorem 1 comprises three steps. First, we prove for a given (q, \u03c0behave)-pair the\nexistence of unique values V (q,\u03c0behave) which obey the following recursion\nq(a|s\ufffd, s)\n\u03c0behave(a|s)\n\nV (q,\u03c0behave)(s) = E\u03c0behave(a|s)\ufffd\u03b1R(s, a) + EP(s\ufffd|s,a)\ufffd\u03b2 log\n(11)\nThis result is obtained through Proposition 2 following [5, 51, 18] where we show that the value\nvector V (q,\u03c0behave) is a unique \ufb01xed point of the operator Bq,\u03c0behave : R|S| \u2192 R|S| given by\nBq,\u03c0behave V (s) := E\u03c0behave(a|s)\ufffd\u03b1R(s, a) + EP(s\ufffd|s,a)\ufffd\u03b2 log\n+ \u03b3V (s\ufffd)\ufffd\ufffd .\n\n+ \u03b3V (q,\u03c0behave)(s\ufffd)\ufffd\ufffd .\n\nSecond, we prove in Proposition 3 that solving the right hand side of Equation (10) for the\npair (q, \u03c0behave) can be achieved with a Blahut-Arimoto-style algorithm in line with [16]. Third,\nwe complete the proof in Proposition 4 based on Proposition 2 and 3 by showing that V \ufffd =\nmax\u03c0behave,q V (q,\u03c0behave), where the vector-valued max-operator is well-de\ufb01ned because both \u03c0behave\n\ufffd\nand q are conditioned on s. The proof completion follows again [5, 51, 18].\nProposition 2 Existence of Unique Values for a Given (q, \u03c0behave)-Pair. Assuming a bounded reward\nfunction R, the value vector V (q,\u03c0behave) as given in Equation (11) exists and is a unique \ufb01xed point\nV (q,\u03c0behave) = Bq,\u03c0behave V (q,\u03c0behave) of the Bellman operator Bq,\u03c0behave from Equation (12).\nAs opposed to the Bellman operator B\ufffd, the operator Bq,\u03c0behave does not include a max-operation that\nincurs a non-trivial recursive relationship between optimal arguments. The proof for existence of\nunique values follows hence standard methodology [5, 51, 18] and is given in Appendix A.1.\nProposition 3 Blahut-Arimoto for One Value Iteration Step. Assuming that R is bounded, the\nmaximization problem max\u03c0behave,q from Equation (10) in the Bellman operator B\ufffd can be solved for\n(q, \u03c0behave) by iterating through the following two equations in an alternating fashion:\n\n(12)\n\nq(m)(a|s\ufffd, s) = P(s\ufffd|s, a)\u03c0(m)\n\ufffda P(s\ufffd|s, a)\u03c0(m)\n\nbehave(a|s)\n\nbehave(a|s)\n\n4\n\n,\n\n(13)\n\n\f\u03c0(m+1)\nbehave (a|s) =\n\nexp\ufffd \u03b1\n\ufffda exp\ufffd \u03b1\n\n\u03b2R(s, a) + EP(s\ufffd|s,a)\ufffdlog q(m)(a|s\ufffd, s) + \u03b3\n\u03b2R(s, a) + EP(s\ufffd|s,a)\ufffdlog q(m)(a|s\ufffd, s) + \u03b3\n\n\u03b2 V (s\ufffd)\ufffd\ufffd\n\u03b2 V (s\ufffd)\ufffd\ufffd ,\nwhere m is the iteration index. The convergence rate is O(1/M ) for arbitrary initial \u03c0(0)\nbehave with\nsupport in A \u2200s. M is the total number of iterations. The complexity for a single s is O(M|S||A|).\nProof Outline. The problem in Proposition 3 is mathematically similar to the maximum channel\ncapacity problem [58] from Proposition 1 and proving convergence follows similar steps that we\noutline here\u2014details can be found in Appendix A.2. First, we prove that optimizing the right-hand\nside of Equation (10) w.r.t. q for a given \u03c0behave results in Equation (13) according to [10] Lemma\n10.8.1. Second, we prove that optimizing w.r.t. \u03c0behave for a given q results in Equation (14) following\nstandard techniques from variational calculus and Lagrange multipliers. Third, we prove convergence\nto a global maximum when iterating alternately through Equations (13) and (14) following [16].\n\n(14)\n\nProposition 4 Completing the Proof of Theorem 1. The optimal value vector is given by V \ufffd =\nmax\u03c0behave,q V (q,\u03c0behave) and is a unique \ufb01xed point V \ufffd = B\ufffdV \ufffd of the Bellman operator B\ufffd.\n\nCompleting the proof of Theorem 1 requires two ingredients: the existence of unique V (q,\u03c0behave)-\nvalues for any (q, \u03c0behave)-pair as proven in Proposition 2, and the fact that the optimal Bellman\noperator can be expressed as B\ufffd = max\u03c0behave,q Bq,\u03c0behave where max\u03c0behave,q is the max-operator from\nProposition 3. The proof follows then standard methodology [5, 51, 18], see Appendix A.3.\n\n4.2 Value Iteration and Convergence to Optimal Values\n\nIn the previous section, we have proven the existence of unique optimal values V \ufffd that are a \ufb01xed\npoint of the Bellman operator B\ufffd. This section devises a value iteration scheme based on the operator\nB\ufffd and proves its convergence. We commence by a corollary to express B\ufffd more concisely.\n\nexp\ufffd \u03b1\n\nbehave\n\n\u03b3\n\u03b2\n\n(15)\n\nCorollary 1 Optimal Bellman Operator. The operator B\ufffd from Equation (10) can be written as\n\nV (s\ufffd)\ufffd\ufffd ,\n\n\u03b2 R(s, a) + EP(s\ufffd|s,a)\ufffdlog qconverged(a|s\ufffd, s) +\n\nB\ufffdV (s) = \u03b2 log\ufffda\nwhere qconverged(a|s\ufffd, s) is the result of the converged Blahut-Arimoto scheme from Proposition 3.\nThis result is obtained by plugging the converged solution \u03c0converged\nfrom Equation (14) into Equa-\ntion (10) and leads naturally to a two-level value iteration algorithm that proceeds as follows: the\nouter loop updates the values V by applying Equation (15) repeatedly; the inner loop applies the\nBlahut-Arimoto algorithm from Proposition 3 to identify qconverged required for the outer value update.\nTheorem 2 Convergence to Optimal Values. Assuming bounded R and let \ufffd \u2208 R be a positive\nnumber such that \ufffd < \u03b7\n1\u2212\u03b3 where \u03b7 = \u03b1 maxs,a |R(s, a)| + \u03b2 log |A|. If the value iteration scheme\n\ufffd iterations, then\ufffd\ufffd\ufffdV \ufffd \u2212 B(i)\nwith initial values of V (s) = 0 \u2200s is run for i \u2265\ufffdlog\u03b3\nProof. Via a sequence of inequalities, one can show that the following holds true:\ufffd\ufffd\ufffdV \ufffd \u2212 B(i)\nV\ufffd\ufffd\ufffd\u221e \u2264 \u03b3i \ufffdV \ufffd \u2212 V \ufffd\u221e \u2264 \u03b3i 1\n\u03b3\ufffd\ufffd\ufffdV \ufffd \u2212 B(i\u22121)\n1\u2212\u03b3 \u03b7 then i \u2265\ufffdlog\u03b3\n\n1\u2212\u03b3 \u03b7\u2014see Appendix A.4 for a more detailed deriva-\n\ufffd\ntion. This implies that if \ufffd \u2265 \u03b3i 1\nConclusion. Together, Theorems 1 and 2 prove that our proposed value iteration scheme convergences\nto optimal values V \ufffd in combination with a corresponding optimal pair (q\ufffd, \u03c0\ufffd\nbehave) as described at\nthe beginning of this section in the third line of Equation (7) and in Equations (8) and (9) respectively.\nThe overal complexity is O(iM|S|2|A|) where i and M refer to outer and inner iterations.\nRemark. Our value iteration is required for both objectives from Section 3 to combine reward\nmaximization with empowerment. Equation (5) motivated our scheme in the \ufb01rst place, whereas\nEquation (6) requires cumulative empowerment values without reward maximization (\u03b1 = 0, \u03b2 = 1).\n\n\ufffd V\ufffd\ufffd\ufffd\u221e \u2264\n\ufffd V\ufffd\ufffd\ufffd\u221e \u2264\n\n\ufffd presupposing \ufffd < \u03b7\n1\u2212\u03b3 .\n\n\ufffd V means to apply B\ufffd to V i-times consecutively.\n\n\ufffd(1\u2212\u03b3)\n\n\u03b7\n\n\ufffd(1\u2212\u03b3)\n\n\u03b7\n\n\ufffd, where the notation B(i)\n\n\ufffd\n\n5\n\n\f4.3 Practical Veri\ufb01cation in a Grid World Example\n\nIn order to practically verify our value iteration scheme from the previous section, we conduct\nexperiments on a grid world example. The outcome is shown in Figure 1 demonstrating how different\ncon\ufb01gurations for \u03b1 and \u03b2, that steer cumulative reward maximization versus empowerment learning,\naffect optimal values V \ufffd. Importantly, the experiments show that our proposal to learn cumulative\none-step empowerment values recovers the original intuition of empowerment in the sense that high\nvalues are assigned to states where many other states can be reached and low values to states where\nthe number of reachable next states is low, but without the necessity to maintain a multi-step policy.\n\nFigure 1: Value Iteration for a Grid World Example. The agent aims to arrive at the goal \u2019G\u2019 in the\nlower left\u2014detailed information regarding the setup can be found in Appendix C.1. The plots show\noptimal values for different \u03b1 and \u03b2: \u03b1 increases from left to right while \u03b2 decreases. The leftmost\nvalues show raw cumulative empowerment learning (\u03b1 = 0.0, \u03b2 = 1.0). High values are assigned to\nstates where many other states can be reached, i.e. the upper right; and low values to states where\nthe number of reachable next states is low, i.e. close to corners and dead ends. The rightmost values\nrecover ordinary cumulative reward maximization (\u03b1 = 1.0, \u03b2 = 0.0) assigning high values to states\nclose to the goal and low values to states far away from the goal.\n\n4.4 Generalization of and Relation to Existing MDP formulations\n\nOur Bellman operator B\ufffd from Equation (10) relates to prior work as follows (see also Appendix A.5).\n\n\u2022 Ordinary value iteration [52] is recovered as a special case for \u03b1 = 1 and \u03b2 = 0.\n\u2022 Cumulative one-step empowerment is recovered as a special case for \u03b1 = 0 and \u03b2 = 1, with\nnon-cumulative one-step empowerment [29] as a further special case of the latter (\u03b3 \u2192 0).\n\u2022 When setting q(a|s\ufffd, s) = q(a|s), using a distribution that is not conditioned on s\ufffd and\nomitting maximizing w.r.t. q, one recovers as a special case the soft Bellman operator\npresented e.g. in [51]. Note that this soft Bellman operator also occurred in numerous other\nwork on MDP formulations and RL [3, 14, 45, 55, 33].\n\n\u2022 As a special case of the previous, when q(a|s\ufffd, s) = U(A) is the uniform distribution\nin action space, one recovers cumulative entropy regularization [70, 44, 34] that inspired\nalgorithms such as soft Q-learning [20] and soft actor-critic [21, 22].\n\n\u2022 When dropping the conditioning on s\ufffd and s by setting q(a|s\ufffd, s) = q(a) but without\nomitting maximization w.r.t. q, one recovers a formulation similar to [65] based on mutual-\ninformation regularization [59, 60, 17, 31] that spurred RL algorithms such as [30, 19, 32].\n\u2022 When replacing q(a|s\ufffd, s) with q(a|s\ufffd, a\ufffd), where s\ufffd and a\ufffd refers to the state-action pair of\nthe previous time step, one recovers a formulation similar to [64] based on the information-\ntheoretic principle of directed information [38, 28, 39].\n\n5 Scaling to High-Dimensional Environments\n\nIn the previous section, we presented a novel Bellman operator in combination with a value iteration\nscheme to combine reward maximization and empowerment. In this section, by leveraging parametric\nfunction approximators, we validate our ideas in high-dimensional state-action spaces and when there\nis no prior knowledge of the state-transition function. In Section 5.1, we devise novel actor-critic\nalgorithms for RL based on our MDP formulation since they are naturally capable of handling both\ncontinuous state and action spaces. In Section 5.2, we practically con\ufb01rm that empowerment can\nboost RL in the high-dimensional robotics simulator domain of MuJoCo using deep neural networks.\n\n6\n\n=0.00,=1.00G36384042=0.25,=0.75G272829303132=0.50,=0.50G1920212223=0.75,=0.25G15202530=1.00,=0.00G101520253035\f5.1 Empowered Off-Policy Actor-Critic Methods with Parametric Function Approximators\n\nContemporary off-policy actor-critic approaches for RL [36, 1, 15] follow the policy gradient the-\norem [63, 13] and learn two parametric function approximators: one for the behavioral policy\n\u03c0\u03c6(a|s) with parameters \u03c6, and one for the state-action value function Q\u03b8(s, a) of the para-\nmetric policy \u03c0\u03c6 with parameters \u03b8. The policy learning objective usually assumes the form:\nmax\u03c6 Es\u223cD\ufffdE\u03c0\u03c6(a|s) [Q\u03b8(s, a)]\ufffd, where D refers to a replay buffer [37] that stores collected state\n\ntransitions from the environment. Following [21], Q-values are learned most ef\ufb01ciently by introducing\nanother function approximator V\u03c8 for state values of \u03c0\u03c6 with parameters \u03c8 using the objective:\n\nmin\n\n\u03b8\n\nEs,a,r,s\ufffd\u223cD\ufffd(Q\u03b8(s, a) \u2212 (\u03b1r + \u03b3V\u03c8(s\ufffd)))2\ufffd ,\n\n(16)\n\nwhere (s, a, r, s\ufffd) refers to an environment interaction sampled from the replay buffer (r stands for\nthe observed reward signal). We multiply r by the scaling factor \u03b1 from our formulation because\nEquation (16) can be directly used for the parametric methods we propose. Learning policy parameters\n\u03c6 and value parameters \u03c8 requires however novel objectives with two additional approximators: one\nfor the inverse dynamics model p\u03c7(a|s\ufffd, s) of \u03c0\u03c6, and one for the transition function P\u03be(s\ufffd|s, a)\n(with parameters \u03c7 and \u03be respectively). While the necessity for p\u03c7 is clear, e.g. from inspecting\nEquation (5), the necessity for P\u03be will fall into place shortly as we move forward.\nIn order to preserve a clear view, let\u2019s de\ufb01ne the quantity f (s, a) := EP\u03be(s\ufffd|s,a) [log p\u03c7(a|s\ufffd, s)] \u2212\nlog \u03c0\u03c6(a|s), which is short-hand notation for the empowerment-induced addition to the reward\nsignal\u2014compare to Equation (5). We then commence with the objective for value function learning:\n(17)\n\nEs\u223cD\ufffd\ufffdV\u03c8(s) \u2212 E\u03c0\u03c6(a|s) [Q\u03b8(s, a) + \u03b2f (s, a)]\ufffd2\ufffd ,\n\nwhich is similar to the standard value objective but with the added term \u03b2f (s, a) as a result of joint\ncumulative empowerment learning. At this point, the necessity for a transition model P\u03be becomes\napparent. In the above equation, new actions a need to be sampled from the policy \u03c0\u03c6 for a given s.\nHowever, the inverse dynamics model (inside f) depends on the subsequent state s\ufffd as well, requiring\ntherefore a prediction for the next state. Note also that (s, a, r, s\ufffd)-tuples from the replay buffer as\nin Equation (16) can\u2019t be used here, because the expectation over a is w.r.t. to the current policy\nwhereas tuples from the replay buffer come from a mixture of policies at an earlier stage of training.\nExtending the ordinary actor-critic policy objective with the empowerment-induced term f yields:\n(18)\n\nmax\n\nmin\n\n\u03c8\n\n\u03c6\n\nEs\u223cD\ufffdE\u03c0\u03c6(a|s) [Q\u03b8(s, a) + \u03b2f (s, a)]\ufffd .\n\nThe remaining parameters to be optimized are \u03c7 and \u03be from the inverse dynamics model p\u03c7\nand the transition model P\u03be. Both problems are supervised learning problems that can be\naddressed by log-likelihood maximization using samples from the replay buffer, leading to\nmax\u03c7 Es\u223cD\ufffdE\u03c0\u03c6(a|s)P\u03be(s\ufffd|s,a) [log p\u03c7(a|s\ufffd, s)]\ufffd and max\u03be Es,a,s\ufffd\u223cD [log P\u03be(s\ufffd|s, a)].\n\nComing back to our motivation from Section 3, we propose two novel empowerment-inspired actor-\ncritic approaches based on the optimization objectives speci\ufb01ed in this section. The \ufb01rst combines\ncumulative reward maximization and empowerment learning following Equation (5) which we refer to\nas empowered actor-critic. The second learns cumulative empowerment values to construct intrinsic\nrewards following Equation (6) which we refer to as actor-critic with intrinsic empowerment.\nEmpowered Actor-Critic (EAC). In line with standard off-policy actor-critic methods [36, 15, 21],\nEAC interacts with the environment iteratively storing transition tuples (s, a, r, s\ufffd) in a replay buffer.\nb=1 \u223c D of size B is sampled from the buffer\nAfter each interaction, a training batch {(s, a, r, s\ufffd)(b)}B\nto perform a single gradient update on the objectives from Equation (16) to (18) as well as the log\nlikelihood objectives for the inverse dynamics and transition model\u2014see Appendix B for pseudocode.\nActor-Critic with Intrinsic Empowerment (ACIE). By setting \u03b1 = 0 and \u03b2 = 1, EAC can train\nan agent merely focusing on cumulative empowerment learning. Since EAC is off-policy, it can\nlearn with samples obtained from executing any policy in the real environment, e.g. the actor of any\nother reward-maximizing actor-critic algorithm. We can then extend external rewards rt at time t\n\nof this actor-critic algorithm with intrinsic rewards E\u03c0\u03c6(a|st)P\u03be(s\ufffd|st,a)\ufffdlog p\u03c7(a|s\ufffd,st)\n\nto Equation (6), where (\u03c6, \u03be, \u03c7) are the result of concurrent raw empowerment learning with EAC.\nThis idea is similar to the preliminary work of [29] using non-cumulative empowerment as intrinsic\nmotivation for deep value-based RL with discrete actions in the Atari game Montezuma\u2019s Revenge.\n\n\u03c0\u03c6(a|st) \ufffd according\n\n7\n\n\f5.2 Experiments with Deep Function Approximators in MuJoCo\n\nWe validate EAC and ACIE in the robotics simulator MuJoCo [66, 8] with deep neural nets under\nthe same setup for each experiment following [67, 25, 50, 24, 56, 36, 68, 57, 1, 9, 15, 21]\u2014see\nAppendix C.2 for details. While EAC is a standalone algorithm, ACIE can be combined with any RL\nalgorithm (we use the model-free state of the art SAC [21]). We compare against DDPG [36] and\nPPO [57] from RLlib [35] as well as SAC on the MuJoCo v2-environments (ten seeds per run [47]).\nThe results in Figure 2 con\ufb01rm that both EAC and ACIE can attain better initial performance compared\nto model-free baselines. While this holds true for both approaches on the pendulum benchmarks\n(balancing and swing up), our empowered methods can also boost RL in demanding environments\nlike Hopper, Ant and Humanoid (the latter two being amongst the most dif\ufb01cult MuJoCo tasks). EAC\nsigni\ufb01cantly improves initial learning in Ant, whereas ACIE boosts SAC in Hopper and Humanoid.\nWhile EAC outperforms PPO and DDPG in almost all tasks, it is not consistently better then SAC.\nSimilarly, the added intrinsic reward from ACIE to SAC does not always help. This is not unexpected\nas it cannot be in general ruled out that reward functions assign high (low) rewards to lowly (highly)\nempowered states, in which case the two learning signals may become partially con\ufb02icting.\n\nFigure 2: MuJoCo Experiments. The plots show maximum episodic rewards (averaged over the\nlast 100 episodes) achieved so far [9] versus steps\u2014non-maximum episodic reward plots can be\nfound in Figure 3. EAC and ACIE are compared to DDPG, PPO and SAC (DDPG did not work in\nAnt, see [21] and Appendix C.2 for an explanation). Shaded areas refer to the standard error. Both\nEAC and ACIE improve initial learning over baselines in the three pendulum tasks (upper row). In\ndemanding problems like Hopper, Ant and Humanoid, our methods can boost RL. In terms of \ufb01nal\nperformance, EAC is competitive with the baselines: it consistently outperforms DDPG and PPO\non all tasks except Hopper, but is not always better than SAC. Similarly, the ACIE-signal does not\nalways help SAC. This is not unexpected as extrinsic and empowered rewards may partially con\ufb02ict.\n\nFor the sake of completeness, we report Figure 3 which is similar to Figure 2 but shows episodic\nrewards and not maximum episodic rewards obtained so far [9]. Also, limits of y-axes are preserved\nfor the pendulum tasks. Note that our SAC baseline is comparable with the SAC from [22] on\nHopper-v2, Walker2d-v2, Ant-v2 and Humanoid-v2 after 5 \u00b7 105 steps (the SAC from [21] uses\nthe earlier v1-versions of Mujoco and is hence not an optimal reference). However, there is a\ndiscrepancy on HalfCheetah-v2. This was earlier noted by others who tried to reproduce SAC results\nin HalfCheetah-v2 but failed to obtain episodic rewards as high as in [21, 22], leading to a GitHub\nissue https://github.com/rail-berkeley/softlearning/issues/75. The \ufb01nal conclusion\nof this issue was that differences in performance are caused by different seed settings and are therefore\nof statistical nature (comparing all algorithms under the same seed settings is hence valid).\n\n8\n\n0500001000001500002000002500006006507007508008509009501000EACACIESACPPODDPG050000100000150000200000250000400050006000700080009000EACACIESACPPODDPG010000020000030000040000050000005001000150020002500EACACIESACPPODDPG010000020000030000040000050000001000200030004000EACACIESACPPODDPG01000002000003000004000005000000500100015002000EACACIESACPPOEnvironment stepsEnvironment stepsEnvironment stepsEnvironment stepsEnvironment stepsEpisodic rewardEpisodic rewardInvertedPendulum-v2InvertedDoublePendulum-v2Walker2d-v2HalfCheetah-v2Ant-v20100000200000300000400000500000010002000300040005000EACACIESACPPODDPGEnvironment stepsHumanoid-v201000002000003000004000005000000500100015002000EACACIESACPPODDPGEnvironment stepsHopper-v2050000100000150000200000250000700600500400300200100Environment stepsPendulum-v0 (Swing up)\fFigure 3: Raw Results of MuJoCo Experiments. The plots are similar to the plots from Figure 2, but\nreport episodic rewards (averaged over the last 100 episodes) versus steps\u2014not maximum episodic\nrewards seen so far as in [9]. For the pendulum tasks, the limits of the y-axes are preserved.\n\n6 Conclusion\n\nThis paper provides a theoretical contribution via a uni\ufb01ed formulation for reward maximization\nand empowerment that generalizes Bellman\u2019s optimality principle and recent information-theoretic\nextensions to it. We proved the existence of and convergence to unique optimal values, and practically\nvalidated our ideas by devising novel parametric actor-critic algorithms inspired by our formulation.\nThese were evaluated on the high-dimensional MuJoCo benchmark demonstrating that empowerment\ncan boost RL in challenging robotics tasks (e.g. Ant and Humanoid).\nThe most promising line of future research is to investigate scheduling schemes that dynamically\ntrade off rewards vs. empowerment with the prospect of obtaining better asymptotic performance.\nEmpowerment could also be particularly useful in a multi-task setting where task transfer could\nbene\ufb01t from initially empowered agents.\n\nAcknowledgments\nWe thank Haitham Bou-Ammar for pointing us in the direction of empowerment.\n\nReferences\n[1] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Max-\nimum a posteriori policy optimisation. In Proceedings of the International Conference on\nLearning Representations, 2018.\n\n[2] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels.\n\nIEEE Transactions on Information Theory, 18(1):14\u201320, 1972.\n\n[3] M. G. Azar, V. Gomez, and H. J. Kappen. Dynamic policy programming with function\napproximation. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2011.\n\n[4] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Springer, 1996.\n[6] W. Bialek, I. Nemenman, and N. Tishby. Predictability, complexity, and learning. Neural\n\nComputation, 13(11):2409\u20132463, 2001.\n\n9\n\n05000010000015000020000025000002004006008001000EACACIESACPPODDPG05000010000015000020000025000002000400060008000EACACIESACPPODDPG010000020000030000040000050000005001000150020002500EACACIESACPPODDPG010000020000030000040000050000001000200030004000EACACIESACPPODDPG0100000200000300000400000500000050010001500EACACIESACPPOEnvironment stepsEnvironment stepsEnvironment stepsEnvironment stepsEnvironment stepsEpisodic rewardEpisodic rewardInvertedPendulum-v2InvertedDoublePendulum-v2Walker2d-v2HalfCheetah-v2Ant-v20100000200000300000400000500000010002000300040005000EACACIESACPPODDPGEnvironment stepsHumanoid-v20100000200000300000400000500000025050075010001250150017502000EACACIESACPPODDPGHopper-v2Environment steps050000100000150000200000250000140012001000800600400200EACACIESACPPODDPGEnvironment stepsPendulum-v0 (Swing up)\f[7] R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions\n\non Information Theory, 18(4):460\u2013473, 1972.\n\n[8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenAI gym. arXiv, 2016.\n\n[9] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful\nof trials using probabilistic dynamics models. In Advances in Neural Information Processing\nSystems, 2018.\n\n[10] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley & Sons, 2006.\n[11] I. Csiszar and G. Tusnady. Information geometry and alternating minimization procedures.\n\nStatistics and Decisions, Suppl.1:205\u2013237, 1984.\n\n[12] I. M. de Abril and R. Kanai. A uni\ufb01ed strategy for implementing curiosity and empowerment\n\ndriven reinforcement learning. arXiv, 2018.\n\n[13] T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In Proceedings of the International\n\nConference on Machine Learning, 2012.\n\n[14] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates.\n\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\n[15] S. Fujimoto, H. van Hoof, and D. Meger. Adressing function approximation error in actor-critic\n\nmethods. In Proceedings of the International Conference on Machine Learning, 2018.\n\n[16] R. G. Gallager. The Arimoto-Blahut algorithm for \ufb01nding channel capacity. Technical report,\n\nMassachusetts Institute of Technology, USA, 1994.\n\n[17] T. Genewein, F. Leibfried, J. Grau-Moya, and D. A. Braun. Bounded rationality, abstraction,\nand hierarchical decision-making: An information-theoretic optimality principle. Frontiers in\nRobotics and AI, 2(27), 2015.\n\n[18] J. Grau-Moya, F. Leibfried, T. Genewein, and D. A. Braun. Planning with information-\nprocessing constraints and model uncertainty in Markov decision processes. In Proceedings\nof the European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases, 2016.\n\n[19] J. Grau-Moya, F. Leibfried, and P. Vrancx. Soft Q-learning with mutual-information reg-\nIn Proceedings of the International Conference on Learning Representations,\n\nularization.\n2019.\n\n[20] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based\n\npolicies. Proceedings of the International Conference on Machine Learning, 2017.\n\n[21] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum\nentropy deep reinforcement learning with a stochastic actor. In Proceedings of the International\nConference on Machine Learning, 2018.\n\n[22] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta,\n\nP. Abbeel, and S. Levine. Soft actor-critic algorithms and applications. arXiv, 2019.\n\n[23] T. Jung, D. Polani, and P. Stone. Empowerment for continuous agent-environment systems.\n\nAdaptive Behavior, 19(1):16\u201339, 2011.\n\n[24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the\n\nInternational Conference on Learning Representations, 2015.\n\n[25] D. P. Kingma and M. Welling. Auto-encoding variational Bayes.\n\nInternational Conference on Learning Representations, 2014.\n\nIn Proceedings of the\n\n[26] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: A universal agent-centric measure\n\nof control. In IEEE Congress on Evolutionary Computation, 2005.\n\n[27] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Keep your options open: An information-based\n\ndriving principle for sensorimotor systems. PloS ONE, 3(12):p.e4018, 2008.\n\n[28] G. Kramer. Directed information for channels with feedback. PhD thesis, University of\n\nManitoba, Canada, 1998.\n\n[29] N. M. Kumar. Empowerment-driven exploration using mutual information estimation. In NIPS\n\nWorkshop, 2018.\n\n10\n\n\f[30] F. Leibfried and D. A. Braun. A reward-maximizing spiking neuron as a bounded rational\n\ndecision maker. Neural Computation, 27(8):1686\u20131720, 2015.\n\n[31] F. Leibfried and D. A. Braun. Bounded rational decision-making in feedforward neural networks.\n\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\n[32] F. Leibfried and J. Grau-Moya. Mutual-information regularization in Markov decision processes\n\nand actor-critic learning. In Proceedings of the Conference on Robot Learning, 2019.\n\n[33] F. Leibfried, J. Grau-Moya, and H. Bou-Ammar. An information-theoretic optimality principle\n\nfor deep reinforcement learning. In NIPS Workshop, 2018.\n\n[34] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.\n\narXiv, 2018.\n\n[35] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan,\nand I. Stoica. RLlib: Abstractions for distributed reinforcement learning. In Proceedings of the\nInternational Conference on Machine Learning, 2018.\n\n[36] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\nContinuous control with deep reinforcement learning. In Proceedings of the International\nConference on Learning Representations, 2016.\n\n[37] L.-J. Lin. Reinforcement learning for robots using neural networks. PhD thesis, Carnegie\n\nMellon University, USA, 1993.\n\n[38] H. Marko. The bidirectional communication theory\u2013a generalization of information theory.\n\nIEEE Transactions on Communications, 21(12):1345\u20131351, 1973.\n\n[39] J. L. Massey and P. C. Massey. Conversion of mutual and directed information. In Proceedings\n\nof the International Symposium on Information Theory, 2005.\n\n[40] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,\nH. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[41] V. Mnih, A. Puigdomenech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and\nK. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of\nthe International Conference on Machine Learning, 2016.\n\n[42] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated\n\nreinforcement learning. In Advances in Neural Information Processing Systems, 2015.\n\n[43] G. Montufar, K. Ghazi-Zahedi, and N. Ay.\n\nlearning for embodied agents. arXiv, 2016.\n\nInformation theoretically aided reinforcement\n\n[44] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy\n\nbased reinforcement learning. Advances in Neural Information Processing Systems, 2017.\n\n[45] G. Neu, V. Gomez, and A. Jonsson. A uni\ufb01ed view of entropy-regularized Markov decision\n\nprocesses. arXiv, 2017.\n\n[46] P. A. Ortega and D. A. Braun. Thermodynamics as a theory of decision-making with information-\n\nprocessing costs. Proceedings of the Royal Society A, 469(2153), 2013.\n\n[47] J. Pineau. Reproducible, reusable, and robust reinforcement learning. NIPS Invited Talk, 2018.\n[48] M. Prokopenko, V. Gerasimov, and I. Tanev. Evolving spatiotemporal coordination in a modular\nrobotic system. In Proceedings of the International Conference on the Simulation of Adaptive\nBehavior, 2006.\n\n[49] A. H. Qureshi, B. Boots, and M. C. Yip. Adversarial imitation via variational inverse reinforce-\nment learning. In Proceedings of the International Conference on Learning Representations,\n2019.\n\n[50] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the International Conference on Machine\nLearning, 2014.\n\n[51] J. Rubin, O. Shamir, and N. Tishby. Trading value and information in MDPs. In Decision\n\nMaking with Imperfect Decision Makers, chapter 3. Springer, 2012.\n\n11\n\n\f[52] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Pearson Education\n\nLimited, 2016.\n\n[53] C. Salge, C. Glackin, and D. Polani. Empowerment\u2013an introduction.\n\nOrganization: Inception, chapter 4. Springer, 2014.\n\nIn Guided Self-\n\n[54] J. Schossau, C. Adami, and A. Hintze. Information-theoretic neuro-correlates boost evolution\n\nof cognitive systems. Entropy, 18(1):6, 2016.\n\n[55] J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-learning.\n\narXiv, 2017.\n\n[56] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization.\n\nIn Proceedings of the International Conference on Machine Learning, 2015.\n\n[57] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. In arXiv, 2017.\n\n[58] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal,\n\n27:379\u2013423,623\u2013656, 1948.\n\n[59] C. E. Shannon. Coding theorems for a discrete source with a \ufb01delity criterion. Institute of Radio\n\nEngineers, International Convention Record, 7:142\u2013163, 1959.\n\n[60] C. A. Sims. Implications of rational inattention. Journal of Monetary Economics, 50(3):665\u2013690,\n\n2003.\n\n[61] S. Still and D. Precup. An information-theoretic approach to curiosity-driven reinforcement\n\nlearning. Theory in Biosciences, 131(3):139\u2013148, 2012.\n\n[62] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[63] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforce-\nment learning with function approximation. In Advances in Neural Information Processing\nSystems, 2000.\n\n[64] S. Tiomkin and N. Tishby. A uni\ufb01ed Bellman equation for causal information and value in\n\nMarkov decision processes. In arXiv, 2018.\n\n[65] N. Tishby and D. Polani. Information theory of decisions and actions. In Perception-Action\n\nCycle, chapter 19. Springer, 2011.\n\n[66] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In\nProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.\n[67] H. van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems,\n\n2010.\n\n[68] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning.\n\nIn Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[69] K. Zahedi, N. Ay, and R. Der. Higher coordination with less control-A result of information\n\nmaximization in the sensorimotor loop. Adaptive Behavior, 18(3-4):338\u2013355, 2010.\n\n[70] B. D. Ziebart. Modeling purposeful adaptive behavior wih the principle of maximum causal\n\nentropy. PhD thesis, Carnegie Mellon University, USA, 2010.\n\n12\n\n\f", "award": [], "sourceid": 4261, "authors": [{"given_name": "Felix", "family_name": "Leibfried", "institution": "PROWLER.io"}, {"given_name": "Sergio", "family_name": "Pascual-D\u00edaz", "institution": "-"}, {"given_name": "Jordi", "family_name": "Grau-Moya", "institution": "PROWLER.io"}]}