{"title": "Maximum Causal Tsallis Entropy Imitation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4403, "page_last": 4413, "abstract": "In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. \nThe proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning\n(MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies.", "full_text": "Maximum Causal Tsallis Entropy Imitation Learning\n\nKyungjae Lee1, Sungjoon Choi2, and Songhwai Oh1\n\nDep. of Electrical and Computer Engineering and ASRI, Seoul National University1\n\nkyungjae.lee@rllab.snu.ac.kr, sam.choi@kakaobrain.com,\n\nKakao Brain2\n\nsonghwai@snu.ac.kr\n\nAbstract\n\nIn this paper, we propose a novel maximum causal Tsallis entropy (MCTE) frame-\nwork for imitation learning which can ef\ufb01ciently learn a sparse multi-modal policy\ndistribution from demonstrations. We provide the full mathematical analysis of the\nproposed framework. First, the optimal solution of an MCTE problem is shown to\nbe a sparsemax distribution, whose supporting set can be adjusted. The proposed\nmethod has advantages over a softmax distribution in that it can exclude unnec-\nessary actions by assigning zero probability. Second, we prove that an MCTE\nproblem is equivalent to robust Bayes estimation in the sense of the Brier score.\nThird, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL)\nalgorithm with a sparse mixture density network (sparse MDN) by modeling mix-\nture weights using a sparsemax distribution. In particular, we show that the causal\nTsallis entropy of an MDN encourages exploration and ef\ufb01cient mixture utilization\nwhile Shannon entropy is less effective.\n\n1\n\nIntroduction\n\nIn this paper, we focus on the problem of imitating demonstrations of an expert who behaves non-\ndeterministically depending on the situation. In imitation learning, it is often assumed that the expert\u2019s\npolicy is deterministic. However, there are instances, especially for complex tasks, where multiple\naction sequences perform the same task equally well. We can model such nondeterministic behavior\nof an expert using a stochastic policy. For example, expert drivers normally show consistent behaviors\nsuch as keeping lane or keeping the distance from a frontal car, but sometimes they show different\nactions for the same situation, such as overtaking a car and turning left or right at an intersection,\nas suggested in [1]. Furthermore, learning multiple optimal action sequences to perform a task is\ndesirable in terms of robustness since an agent can easily recover from failure due to unexpected\nevents [2, 3]. In addition, a stochastic policy promotes exploration and stability during learning\n[4, 2, 5]. Hence, modeling experts\u2019 stochasticity can be a key factor in imitation learning.\nTo this end, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation\nlearning, which can learn from a uni-modal to multi-modal policy distribution by adjusting its\nsupporting set. We \ufb01rst show that the optimal policy under the MCTE framework follows a sparsemax\ndistribution [6], which has an adaptable supporting set in a discrete action space. Traditionally, the\nmaximum causal entropy (MCE) framework [1, 7] has been proposed to model stochastic behavior in\ndemonstrations, where the optimal policy follows a softmax distribution. However, it often assigns\nnon-negligible probability mass to non-expert actions when the number of actions increases [3, 8].\nOn the contrary, as the optimal policy of the proposed method can adjust its supporting set, it can\nmodel various expert\u2019s behavior from a uni-modal distribution to a multi-modal distribution.\nTo apply the MCTE framework to a complex and model-free problem, we propose a maximum causal\nTsallis entropy imitation learning (MCTEIL) with a sparse mixture density network (sparse MDN)\nwhose mixture weights are modeled as a sparsemax distribution. By modeling expert\u2019s behavior\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fusing a sparse MDN, MCTEIL can learn varying stochasticity depending on the state in a continuous\naction space. Furthermore, we show that the MCTEIL algorithm can be obtained by extending the\nMCTE framework to the generative adversarial setting, similarly to generative adversarial imitation\nlearning (GAIL) by Ho and Ermon [9], which is based on the MCE framework. The main bene\ufb01t of\nthe generative adversarial setting is that the resulting policy distribution is more robust than that of a\nsupervised learning method since it can learn recovery behaviors from less demonstrated regions to\ndemonstrated regions by exploring the state-action space during training. Interestingly, we also show\nthat the Tsallis entropy of a sparse MDN has an analytic form and is proportional to the distance\nbetween mixture means. Hence, maximizing the Tsallis entropy of a sparse MDN encourages\nexploration by providing bonus rewards to wide-spread mixture means and penalizing collapsed\nmixture means, while the causal entropy [1] of an MDN is less effective in terms of preventing the\ncollapse of mixture means since there is no analytical form and its approximation is used in practice\ninstead. Consequently, maximizing the Tsallis entropy of a sparse MDN has a clear bene\ufb01t over the\ncausal entropy in terms of exploration and mixture utilization.\nTo validate the effectiveness of the proposed method, we conduct two simulation studies. In the\n\ufb01rst simulation study, we verify that MCTEIL with a sparse MDN can successfully learn multi-\nmodal behaviors from expert\u2019s demonstrations. A sparse MDN ef\ufb01ciently learns a multi-modal\npolicy without performance loss, while a single Gaussian and a softmax-based MDN suffer from\nperformance loss. The second simulation study is conducted using four continuous control problems\nin MuJoCo [10]. MCTEIL outperforms existing methods in terms of the average cumulative return.\nIn particular, MCTEIL shows the best performance for the reacher problem with a smaller number of\ndemonstrations while GAIL often fails to learn the task.\n\n2 Related Work\n\nThe early researches on IRL [1, 11\u201318] can be categorized into two groups: a margin based and\nentropy based method. A margin based method maximizes the margin between the value of the\nexpert\u2019s policy and all other policies [11, 12]. In [11], Abbeel and Ng proposed an apprenticeship\nlearning where the rewards function is estimated to maximize the margin between the expert\u2019s policy\nand randomly sampled policies. In [12], Ratliff et al. proposed the maximum margin planning (MMP)\nwhere Bellman-\ufb02ow constraints are introduced to consider the margin between the experts\u2019 policy\nand all other possible policies. On the contrary, an entropy based method is \ufb01rst proposed in [1] to\nhandle the stochastic behavior of the expert. Ziebart et al. [1] proposed a maximum entropy inverse\nreinforcement learning (MaxEnt IRL) using the principle of maximum (Shannon) entropy to handle\nambiguity issues of IRL. Ramachandran et al. [13] proposed Bayesian inverse reinforcement learning\n(BIRL) where the Bayesian probabilistic model over demonstrations is proposed and the expert policy\nand rewards are inferred by using a Metropolis-Hastings (MH) method. In[1, 13], the expert behavior\nis modeled as a softmax distribution of an action value which is the optimal solution of the maximum\nentropy problem. We also note that [14\u201318] are variants based on [1, 13].\nIn [9], Ho and Ermon have extended [1] to a uni\ufb01ed framework for two groups by adding a reward\nregularization. Most existing IRL methods can be interpreted as the uni\ufb01ed framework with different\nreward regularization. Those methods including the aforementioned algorithms [1, 11\u201318] require to\nsolve an MDP problem every iterations to update a reward function. In model-free case, reinforcement\nlearning (RL) method should be applied to solve the MDP, which leads to high computational costs and\nhuge amounts of samples. To address this issue, Ho and Ermon proposed the generative adversarial\nimitation learning (GAIL) method where the policy function is updated to maximize the reward\nfunction and the reward function is updated to assign high values to expert\u2019s demonstrations and low\nvalues to trained policy\u2019s demonstrations. GAIL achieves sample ef\ufb01ciency by avoiding the need to\nsolve RL as a subroutine and alternatively updating policy and reward functions.\nRecently, several variants of GAIL [19\u201321] have been developed based on the maximum entropy\nframework. These methods [19\u201321] focus on handling the multi-modality in demonstrations by\nlearning the latent structure. In [19], Hausman et al. proposed an imitation learning method to\nlearn policies using unlabeled demonstrations collected from multiple different tasks where the\nlatent intention is introduced in order to separate mixed demonstrations. Similarly, in [20], a robust\nimitation learning method is proposed, which separates unlabeled demonstrations by assigning the\nlatent code using a variational autoencoder. The encoding network assigns the latent code to the\ninput demonstration. Then, the policy network is trained to mimic the input demonstration given the\n\n2\n\n\flatent code and the encoding network is trained to recover the given latent code from the generated\ntrajectory. In [21], the latent code is also proposed to handle multi-modal demonstrations. The latent\nstructure in [21] is learned by maximizing the lower bound of mutual information between the latent\ncode and the corresponding demonstrations. Consequently, existing imitation learning methods which\ncan handle the multi-modal behavior have common features in that they are developed based on\nthe maximum entropy framework and capture the multi-modality of demonstrations by learning the\nmapping from demonstrations to the latent space.\nUnlikely to recent methods for multi-modal demonstrations, the proposed method is established on\nthe maximum causal Tsallis entropy framework which induces a sparse distribution whose supporting\nset can be adjusted, instead of the original maximum entropy. Furthermore, a policy is modeled as a\nsparse mixture density network (sparse MDN) which can learn multi-modal behavior directly instead\nof learning the latent structure.\n\n3 Background\n\nMarkov Decision Processes Markov decision processes (MDPs) are a well-known mathemati-\ncal framework for a sequential decision making problem. A general MDP is de\ufb01ned as a tuple\n{S,F,A, \u03c6, \u03a0, d, T, \u03b3, r}, where S is the state space, F is the corresponding feature space, A\nis the action space, \u03c6 is a feature map from S \u00d7 A to F, \u03a0 is a set of stochastic policies, i.e.,\na(cid:48) \u03c0(a(cid:48)|s) = 1}, d(s) is the initial state distribution,\nT (s(cid:48)|s, a) is the transition probability from s \u2208 S to s(cid:48) \u2208 S by taking a \u2208 A, \u03b3 \u2208 (0, 1) is a discount\nfactor, and r is the reward function from a state-action pair to a real value. In general, the goal of\nan MDP is to \ufb01nd an optimal policy distribution \u03c0\u2217 \u2208 \u03a0 which maximizes the expected discount\nt=0 r(st, at)|\u03c0, d]. Note that, for any function f (s, a),\n\n\u03a0 = {\u03c0 | \u2200s \u2208 S, a \u2208 A, \u03c0(a|s) \u2265 0 and (cid:80)\nsum of rewards, i.e., E\u03c0 [r(s, a)] (cid:44) E [(cid:80)\u221e\nE [(cid:80)\u221e\n\nt=0 f (st, at)|\u03c0, d] will be denoted as E\u03c0 [f (s, a)].\n\nMaximum Causal Entropy Inverse Reinforcement Learning Zeibart et al. [1] proposed the\nmaximum causal entropy framework, which is also known as maximum entropy inverse reinforcement\nlearning (MaxEnt IRL). MaxEnt IRL maximizes the causal entropy of a policy distribution while the\nfeature expectation of the optimized policy distribution is matched with that of expert\u2019s policy. The\nmaximum causal entropy framework is de\ufb01ned as follows:\n\n\u03c0\u2208\u03a0\n\n\u03b1H(\u03c0)\n\nmaximize\nsubject to E\u03c0 [\u03c6(s, a)] = E\u03c0E [\u03c6(s, a)] ,\n\n(1)\n\nwhere H(\u03c0) (cid:44) E\u03c0 [\u2212 log(\u03c0(a|s))] is the causal entropy of policy \u03c0, \u03b1 is a scale parameter, \u03c0E is\nthe policy distribution of the expert. Maximum causal entropy estimation \ufb01nds the most uniformly\ndistributed policy satisfying feature matching constraints. The feature expectation of the expert\npolicy is used as a statistic to represent the behavior of an expert and is approximated from expert\u2019s\ndemonstrations D = {\u03b60,\u00b7\u00b7\u00b7 , \u03b6N}, where N is the number of demonstrations and \u03b6i is a sequence\nof state and action pairs whose length is T , i.e., \u03b6i = {(s0, a0),\u00b7\u00b7\u00b7 , (sT , aT )}. In [22], it is shown\nthat the optimal solution of (1) is a softmax distribution.\n\nGenerative Adversarial Imitation Learning In [9], Ho and Ermon have extended (1) to a uni\ufb01ed\nframework for IRL by adding a reward regularization as follows:\n\nmax\n\nc\n\nmin\n\u03c0\u2208\u03a0\n\n\u2212 \u03b1H(\u03c0) + E\u03c0 [c(s, a)] \u2212 E\u03c0E [c(s, a)] \u2212 \u03c8(c),\n\n(2)\n\nwhere c is a cost function and \u03c8 is a convex regularization for cost c. As shown in [9], many existing\nIRL methods can be interpreted with this framework, such as MaxEnt IRL [1], apprenticeship learning\n[11], and multiplicative weights apprenticeship learning [23]. Existing IRL methods based on (2)\noften require to solve the inner minimization over \u03c0 for \ufb01xed c in order to compute the gradient of c.\nIn [22], Ziebart showed that the inner minimization is equivalent to a soft Markov decision process\n(soft MDP) under the reward \u2212c and proposed soft value iteration to solve the soft MDP. However,\nsolving a soft MDP every iteration is often intractable for problems with large state and action spaces\nand also requires the transition probability which is not accessible in many cases. To address this\nissue, the generative adversarial imitation learning (GAIL) framework is proposed in [9] to avoid\n\n3\n\n\fsolving the soft MDP problem directly. The uni\ufb01ed imitation learning problem (2) can be converted\ninto the GAIL framework as follows:\n\nmin\n\u03c0\u2208\u03a0\n\nmax\nD\n\nE\u03c0 [log(D(s, a))] + E\u03c0E [log(1 \u2212 D(s, a))] \u2212 \u03b1H(\u03c0),\n\n(3)\nwhere D \u2208 (0, 1)|S||A| indicates a discriminator, which returns the probability that a given demon-\nstration is from a learner, i.e., 1 for learner\u2019s demonstrations and 0 for expert\u2019s demonstrations. Notice\nthat we can interpret log(D) as cost c (or reward of \u2212c).\nSince existing IRL methods, including GAIL, are often based on the maximum causal entropy, they\nmodel the expert\u2019s policy using a softmax distribution, which can assign non-zero probability to\nnon-expert actions in a discrete action space. Furthermore, in a continuous action space, expert\u2019s\nbehavior is often modeled using a uni-modal Gaussian distribution, which is not proper to model\nmulti-modal behaviors. To handle these issues, we propose a sparsemax distribution as the policy of\nan expert and provide a natural extension to handle a continuous action space using a mixture density\nnetwork with sparsemax weight selection.\n\nSparse Markov Decision Processes\nIn [3], a sparse Markov decision process (sparse MDP) is\nE\u03c0 [1 \u2212 \u03c0(a|s)] to the expected\nproposed by utilizing the causal sparse Tsallis entropy W (\u03c0) (cid:44) 1\ndiscounted rewards sum, i.e., E\u03c0 [r(s, a)] + \u03b1W (\u03c0). Note that W (\u03c0) is an extension of a special case\nof the generalized Tsallis entropy, i.e., Sk,q(p) = k\n2 , q = 2, to sequential\nrandom variables 1. It is shown that the optimal policy of a sparse MDP is a sparse and multi-modal\npolicy distribution [3]. Furthermore, sparse Bellman optimality conditions were derived as follows:\n\nq\u22121 (1 \u2212(cid:80)\n\ni ), for k = 1\n\ni pq\n\n2\n\n(cid:48)\nV (s\n\n(cid:88)\n(cid:32)(cid:18) Q(s, a)\n\ns(cid:48)\n\n)T (s\n\n\u03b1\n\n(cid:48)|s, a), \u03c0(a|s) = max\n\n(cid:18) Q(s,\u00b7)\n\n(cid:19)2(cid:33)\n\n(cid:19)2 \u2212 \u03c4\n\n\u03b1\n\n(cid:18) Q(s, a)\n\uf8f9\uf8fb ,\n\n\u03b1\n\n1\n2\n\n+\n\n(cid:18) Q(s,\u00b7)\n\n(cid:19)\n\n(cid:19)\n\n, 0\n\n,\n\n\u03b1\n\n\u2212 \u03c4\n\n(4)\n\n>\n\nQ(s, a) (cid:44) r(s, a) + \u03b3\n\n\uf8ee\uf8f0 1\n(cid:17)\n\n2\n\n(cid:88)\n(cid:80)\n\na\u2208S(s)\n\n=\n\nV (s) = \u03b1\n\n(cid:16) Q(s,\u00b7)\n\n\u03b1\nQ(s,a(j))\n\nwhere \u03c4\n\n(cid:80)i\n\na\u2208S(s)\n\nQ(s,a)\n\n\u03b1 \u22121\n\nKs\n\n, S(s) is a set of actions satisfying 1 + i Q(s,a(i))\n\n\u03b1\n\n\u03b1\n\nj=1\n\nwith a(i) indicating the action with the ith largest state-action value Q(s, a), and Ks\nis the cardinality of S(s). In [3], a sparsemax policy shows better performance compared to a softmax\npolicy since it assigns zero probability to non-optimal actions whose state-action value is below the\nthreshold \u03c4. In this paper, we utilize this property in imitation learning by modeling expert\u2019s behavior\nusing a sparsemax distribution. In Section 4, we show that the optimal solution of an MCTE problem\nalso has a sparsemax distribution and, hence, the optimality condition of sparse MDPs is closely\nrelated to that of MCTE problems.\n\n4 Principle of Maximum Causal Tsallis Entropy\n\nIn this section, we formulate maximum causal Tsallis entropy imitation learning (MCTEIL) and show\nthat MCTE induces a sparse and multi-modal distribution which has an adaptable supporting set. The\nproblem of maximizing the causal Tsallis entropy W (\u03c0) can be formulated as follows:\n\n\u03c0\u2208\u03a0\n\n\u03b1W (\u03c0)\n\nmaximize\nsubject to E\u03c0 [\u03c6(s, a)] = E\u03c0E [\u03c6(s, a)] .\n\n(5)\n\nIn order to derive optimality conditions, we will \ufb01rst change the optimization variable from a policy\ndistribution to a state-action visitation measure. Then, we prove that the MCTE problem is concave\nwith respect to the visitation measure. The necessary and suf\ufb01cient conditions for an optimal\nsolution are derived from the Karush-Kuhn-Tucker (KKT) conditions using the strong duality and the\noptimal policy is shown to be a sparsemax distribution. Furthermore, we also provide an interesting\ninterpretation of the MCTE framework as robust Bayes estimation in terms of the Brier score. Hence,\n\n1The casual entropy is generally de\ufb01ned upon causally conditioned random variables. However, in this\npaper, the causal Tsallis entropy is de\ufb01ned over the random variables with Markov properties, i.e., \u03c0(at|st) =\n\u03c0(at|st, at\u22121, st\u22121,\u00b7\u00b7\u00b7 , a0, s0), since we only consider an MDP.\n\n4\n\n\fthe proposed method can be viewed as maximization of the worst case performance in the sense of\nthe Brier score [24].\nWe can change the optimization variable from a policy distribution to a state-action visitation measure\nbased on the following theorem.\n\na \u03c1(s, a) = d(s) + \u03b3(cid:80)\nM (cid:44) {\u03c1|\u2200s, a, \u03c1(s, a) \u2265 0, (cid:80)\nit is a state-action visitation measure for \u03c0\u03c1(a|s) (cid:44) \u03c1(s,a)(cid:80)\n\nTheorem 1 (Theorem 2 of Syed et al. [25]) Let M be a set of state-action visitation measures, i.e.,\ns(cid:48),a(cid:48) T (s|s(cid:48), a(cid:48))\u03c1(s(cid:48), a(cid:48))}. If \u03c1 \u2208 M, then\na \u03c1(s,a) , and \u03c0\u03c1 is the unique policy whose\n\nstate-action visitation measure is \u03c1.\n\n(cid:80)\n\n(cid:16)\n\n(cid:80)\n\n(cid:17)\n\nThe proof of Theorem 1 can be found in [25] or in Puterman [26]. Theorem 1 guarantees the one-\nto-one correspondence between a policy distribution and state-action visitation measure. Then, the\nobjective function W (\u03c0) is converted into the function of \u03c1 as follows.\n\ns,a \u03c1(s, a)\n\n1 \u2212 \u03c1(s,a)\n\nTheorem 2 Let \u00afW (\u03c1) = 1\n2\nand any state-action visitation measure \u03c1 \u2208 M, W (\u03c0) = \u00afW (\u03c1\u03c0) and \u00afW (\u03c1) = W (\u03c0\u03c1) hold.\nThe proof is provided in the supplementary material. Theorem 2 tells us that if \u00afW (\u03c1) has the\nmaximum at \u03c1\u2217, then W (\u03c0) also has the maximum at \u03c0\u03c1\u2217. Based on Theorem 1 and 2, we can freely\nconvert the problem (5) into\n\n. Then, for any stationary policy \u03c0 \u2208 \u03a0\n\na(cid:48) \u03c1(s,a(cid:48))\n\n\u03b1 \u00afW (\u03c1)\n\n\u03c1\u2208M\n\nmaximize\n\nsubject to (cid:88)\n\n(cid:88)\n\n\u03c1(s, a)\u03c6(s, a) =\n\n\u03c1E(s, a)\u03c6(s, a),\n\n(6)\n\nwhere \u03c1E is the state-action visitation measure corresponding to \u03c0E.\n\ns,a\n\ns,a\n\n4.1 Optimality Condition of Maximum Causal Tsallis Entropy\n\nWe show that the optimal policy of the problem (6) is a sparsemax distribution using the KKT\nconditions. In order to use the KKT conditions, we \ufb01rst show that the MCTE problem is concave.\nTheorem 3 \u00afW (\u03c1) is strictly concave with respect to \u03c1 \u2208 M.\nThe proof of Theorem 3 is provided in the supplementary material. Since all constraints are linear\nand the objective function is concave, (6) is a concave problem and, hence, strong duality holds. The\ndual problem is de\ufb01ned as follows:\n\nmax\nmin\n\u03c1\n\u03b8,c,\u03bb\nsubject to\n\nwhere LW (\u03b8, c, \u03bb, \u03c1) = \u2212\u03b1 \u00afW (\u03c1) \u2212 (cid:80)\n(cid:80)\ns,a \u03bbsa\u03c1(s, a) +(cid:80)\n\n\u03c6(s, a) \u2212\nand \u03b8, c, and \u03bb are\nLagrangian multipliers and the constraints come from M. Then, the optimal solution of primal and\ndual variables necessarily and suf\ufb01ciently satisfy the KKT conditions.\n\n(cid:17)\ns(cid:48),a(cid:48) T (s|s(cid:48), a(cid:48))\u03c1(s(cid:48), a(cid:48))\n\n\u03c6(s, a) + (cid:80)\n\ns,a \u03c1E(s, a)\u03b8\n\ns,a \u03c1(s, a)\u03b8\n\ns cs\n\n(7)\n\n(cid:124)\n\nLW (\u03b8, c, \u03bb, \u03c1)\n\u2200 s, a \u03bbsa \u2265 0,\n(cid:124)\n\nTheorem 4 The optimal solution of (6) suf\ufb01ciently and necessarily satis\ufb01es the following conditions:\n\nqsa (cid:44) \u03b8\n\n(cid:124)\n\n\u03c6(s, a) + \u03b3\n\n(cid:16)(cid:80)\na \u03c1(s, a) \u2212 d(s) \u2212 \u03b3(cid:80)\n\uf8ee\uf8f0 1\n(cid:16) qsa\n\n(cid:48)|s, a), cs = \u03b1\n\n\u03c0\u03c1(a|s) = max\n\ncs(cid:48) T (s\n\ns(cid:48)\n\n2\n\n(cid:88)\n\nand\n\n(cid:17)2(cid:19)\n\n(cid:17)2 \u2212 \u03c4\n\n(cid:16) qs\n\n\u03b1\n\n\uf8f9\uf8fb ,\n\n+\n\n1\n2\n\n(cid:18)(cid:16) qsa\n(cid:88)\n(cid:17)\n(cid:17)\n(cid:16) qs\n\n\u03b1\n\na\u2208S(s)\n\u2212 \u03c4\n\n, 0\n\n,\n\n\u03b1\n\n\u03b1\n\nwhere \u03c0\u03c1(a|s) = \u03c1(s,a)(cid:80)\n\na \u03c1(s,a) , qsa is an auxiliary variable, and qs = [qsa1 \u00b7\u00b7\u00b7 qsa|A| ]\n\n(cid:124).\n\nThe optimality conditions of the problem (6) tell us that the optimal policy is a sparsemax distribution\nwhich assigns zero probability to an action whose auxiliary variable qsa is below the threshold \u03c4,\n\n5\n\n\fAlgorithm 1 Maximum Causal Tsallis Entropy Imitation Learning\n1: Expert\u2019s demonstrations D are given\n2: Initialize policy and discriminator parameters \u03bd, \u03c9\n3: while until convergence do\n4:\n5:\n6:\n7: end while\n\nUpdate \u03c9 with the gradient of(cid:80)\n\n{\u03b6} log(D\u03c9(s, a)) +(cid:80)\n\nSample trajectories {\u03b6} from \u03c0\u03bd\nUpdate \u03bd using a policy optimization method with reward function \u2212E\u03c0\u03bd [log(D\u03c9(s, a))] + \u03b1W (\u03c0\u03bd )\n\nD log(1 \u2212 D\u03c9(s, a)).\n\nwhich determines a supporting set. If expert\u2019s policy is multi-modal at state s, the resulting \u03c0\u03c1(\u00b7|s)\nbecomes multi-modal and induces a multi-modal distribution with a large supporting set. Otherwise,\nthe resulting policy has a sparse and smaller supporting set. Therefore, a sparsemax policy has\nadvantages over a softmax policy for modeling sparse and multi-modal behaviors of an expert whose\nsupporting set varies according to the state.\nFurthermore, we also discover an interesting connection between the optimality condition of an\nMCTE problem and the sparse Bellman optimality condition (4). Since the optimality condition is\nequivalent to the sparse Bellman optimality equation [3], we can compute the optimal policy and\nLagrangian multiplier cs by solving a sparse MDP under the reward function r(s, a) = \u03b8\u2217(cid:124)\n\u03c6(s, a),\nwhere \u03b8\u2217 is the optimal dual variable. In addition, cs and qsa can be viewed as a state value and\nstate-action value for the reward \u03b8\u2217(cid:124)\n\n\u03c6(s, a), respectively.\n\n4.2\n\nInterpretation as Robust Bayes\n\nIn this section, we provide an interesting interpretation about the MCTE framework. In general,\nmaximum entropy estimation can be viewed as a minimax game between two players. One player is\ncalled a decision maker and the other player is called the nature, where the nature assigns a distribution\nto maximize the decision maker\u2019s misprediction while the decision maker tries to minimize it [27].\nThe same interpretation can be applied to the MCTE framework. We show that the proposed MCTE\nproblem is equivalent to a minimax game with the Brier score [24].\n\nTheorem 5 The maximum causal Tsallis entropy distribution minimizes the worst case prediction\nBrier score,\n\n(cid:34)(cid:88)\n\n(cid:0)1{a(cid:48)=a} \u2212 \u03c0(a|s)(cid:1)2\n\n(cid:35)\n(cid:0)1{a(cid:48)=a} \u2212 \u03c0(a|s)(cid:1)2 is the Brier score.\n\nE\u02dc\u03c0\n\n1\n2\n\na(cid:48)\n\nmin\n\u03c0\u2208\u03a0\n\nmax\n\u02dc\u03c0\u2208\u03a0\n\nwhere(cid:80)\n\na(cid:48) 1\n2\n\nsubject to E\u03c0 [\u03c6(s, a)] = E\u03c0E [\u03c6(s, a)]\n\n(8)\n\nNote that minimizing the Brier score minimizes the misprediction ratio while we call it a score here.\nTheorem 5 is a straightforward extension of the robust Bayes results in [27] to sequential decision\nproblems. This theorem tells us that the MCTE problem can be viewed as a minimax game between\na sequential decision maker \u03c0 and the nature \u02dc\u03c0 based on the Brier score. In this regards, the resulting\nestimator can be interpreted as the best decision maker against the worst that the nature can offer.\n\n5 Maximum Causal Tsallis Entropy Imitation Learning\n\nIn this section, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm\nto solve a model-free IL problem in a continuous action space. In many real-world problems, state\nand action spaces are often continuous and transition probability of a world cannot be accessed. To\napply the MCTE framework for a continuous space and model-free case, we follow the extension of\nGAIL [9], which trains a policy and reward alternatively, instead of solving RL at every iteration. We\nextend the MCTE framework to a more general case with reward regularization and it is formulated\nby replacing the causal entropy H(\u03c0) in the problem (2) with the causal Tsallis entropy W (\u03c0) as\nfollows:\n\nmax\n\n\u03b8\n\nmin\n\u03c0\u2208\u03a0\n\n\u2212 \u03b1W (\u03c0) \u2212 E\u03c0 [\u03b8\n\n(cid:124)\n\n\u03c6(s, a)] + E\u03c0E [\u03b8\n\n(cid:124)\n\n\u03c6(s, a)] \u2212 \u03c8(\u03b8).\n\n(9)\n\nSimilarly to [9], we convert the problem (9) into the generative adversarial setting as follows.\n\n6\n\n\fTheorem 6 The maximum causal sparse Tsallis entropy problem (9) is equivalent to the problem:\n\n\u2217\n\n(E\u03c0 [\u03c6(s, a)] \u2212 E\u03c0E [\u03c6(s, a)]) \u2212 \u03b1W (\u03c0),\n\n\u03c8\n\nmin\n\u03c0\u2208\u03a0\nx \u2212 \u03c8(y)}.\n(cid:124)\n\nwhere \u03c8\u2217(x) = supy{y\nThe proof is detailed in the supplementary material. The proof of Theorem 6 depends on the fact\nthat the objective function of (9) is concave with respect to \u03c1 and is convex with respect to \u03b8. Hence,\nwe \ufb01rst switch the optimization variables from \u03c0 to \u03c1 and, using the minimax theorem [28], the\nmaximization and minimization are interchangeable and the generative adversarial setting is derived.\nSimilarly to [9], Theorem 6 says that a MCTE problem can be interpreted as minimization of\nthe distance between expert\u2019s feature expectation and training policy\u2019s feature expectation, where\n\u03c8\u2217(x1 \u2212 x2) is a proper distance function since \u03c8(x) is a convex function. Let esa \u2208 R|S||A| be\na feature indicator vector, such that the sath element is one and zero elsewhere. If we set \u03c8 to\nesa)], where g(x) = \u2212x \u2212 log(1 \u2212 ex) for x < 0 and g(x) = \u221e for x \u2265 0, we\n\u03c8GA(\u03b8) (cid:44) E\u03c0E [g(\u03b8\ncan convert the MCTE problem into the following generative adversarial setting:\n\n(cid:124)\n\nmin\n\u03c0\u2208\u03a0\n\nmax\nD\n\nE\u03c0 [log(D(s, a))] + E\u03c0E [log(1 \u2212 D(s, a))] \u2212 \u03b1W (\u03c0),\n\n(10)\n\nwhere D is a discriminator. The problem (10) can be solved by MCTEIL which consists of three\nsteps. First, trajectories are sampled from the training policy \u03c0\u03bd and discriminator D\u03c9 is updated\nto distinguish whether the trajectories are generated by \u03c0\u03bd or \u03c0E. Finally, the training policy \u03c0\u03bd is\nupdated with a policy optimization method under the sum of rewards E\u03c0 [\u2212 log(D\u03c9(s, a))] with a\ncausal Tsallis entropy bonus \u03b1W (\u03c0\u03bd). The algorithm is summarized in Algorithm 1.\n\nSparse Mixture Density Network We further employ a novel mixture density network (MDN)\nwith sparsemax weight selection, which can model sparse and multi-modal behavior of an expert,\nwhich is called a sparse MDN. In many imitation learning algorithms, a Gaussian network is often\nemployed to model expert\u2019s policy in a continuous action space. However, a Gaussian distribution is\ninappropriate to model the multi-modality of an expert since it has a single mode. An MDN is more\nsuitable for modeling a multi-modal distribution. In particular, a sparse MDN is a proper extension of\na sparsemax distribution for a continuous action space. The input of a sparse MDN is state s and the\noutput of a sparse MDN is components of K mixtures of Gaussians: mixture weights {wi}, means\n{\u00b5i}, and covariance matrices {\u03a3i}. A sparse MDN policy is de\ufb01ned as\n\nK(cid:88)\n\n\u03c0(a|s) =\n\nwi(s)N (a; \u00b5i(s), \u03a3i(s)),\n\ni\n\nwhere N (a; \u00b5, \u03a3) indicates a multivariate Gaussian density at point a with mean \u00b5 and covariance\n\u03a3. In our implementation, w(s) is computed as a sparsemax distribution, while most existing MDN\nimplementations utilize a softmax distribution. Modeling the expert\u2019s policy using an MDN with K\nmixtures can be interpreted as separating continuous action space into K representative actions. Since\nwe model mixture weights using a sparsemax distribution, the number of mixtures used to model\nthe expert\u2019s policy can vary depending on the state. In this regards, the sparsemax weight selection\nhas an advantage over the soft weight selection since the former utilizes mixture components more\nef\ufb01ciently as unnecessary components will be assigned with zero weights.\n\nTsallis Entropy of Mixture Density Network An interesting fact is that the causal Tsallis entropy\nof an MDN has an analytic form while the Gibbs-Shannon entropy of an MDN is intractable.\n\ni wi(s)N (a; \u00b5i(s), \u03a3i(s)) and \u03c1\u03c0(s) =(cid:80)\nTheorem 7 Let \u03c0(a|s) =(cid:80)K\n(cid:32)\n1 \u2212 K(cid:88)\n\n(cid:88)\n\nK(cid:88)\n\nW (\u03c0) =\n\nwi(s)wj(s)N (\u00b5i(s); \u00b5j(s), \u03a3i(s) + \u03a3j(s))\n\na \u03c1\u03c0(s, a). Then,\n\n.\n\n(11)\n\n(cid:33)\n\n1\n2\n\n\u03c1\u03c0(s)\n\ns\n\ni\n\nj\n\nThe proof is included in the supplementary material. The analytic form of the Tsallis entropy shows\nthat the Tsallis entropy is proportional to the distance between mixture means. Hence, maximizing\nthe Tsallis entropy of a sparse MDN encourages exploration of diverse directions during the policy\noptimization step of MCTEIL. In imitation learning, the main bene\ufb01t of the generative adversarial\n\n7\n\n\fsetting is that the resulting policy is more robust than that of supervised learning since it can learn\nhow to recover from a less demonstrated region to a demonstrated region by exploring the state-action\nspace during training. Maximum Tsallis entropy of a sparse MDN encourages ef\ufb01cient exploration by\ngiving bonus rewards when mixture means are spread out. (11) also has an effect of utilizing mixtures\nmore ef\ufb01ciently by penalizing for modeling a single mode using several mixtures. Consequently, the\nTsallis entropy W (\u03c0) has clear bene\ufb01ts in terms of both exploration and mixture utilization.\n\n6 Experiments\n\nTo verify the effectiveness of the proposed method, we compare MCTEIL with several other im-\nitation learning methods. First, we use behavior cloning (BC) as a baseline. Second, generative\nadversarial imitation learning (GAIL) with a single Gaussian distribution is compared. We also\ncompare a straightforward extension of GAIL for a multi-modal policy by using a softmax weighted\nmixture density network (soft MDN) in order to validate the ef\ufb01ciency of the proposed sparse-\nmax weighted MDN. In soft GAIL, due to the intractability of the causal entropy of a mixture of\nGaussians, we approximate the entropy term by adding \u2212\u03b1 log(\u03c0(at|st)) to \u2212 log(D(st, at)) since\nE\u03c0 [\u2212 log(D(s, a))] + \u03b1H(\u03c0) = E\u03c0 [\u2212 log(D(s, a)) \u2212 \u03b1 log(\u03c0(a|s))]. We also compare info GAIL\n[21] which learns simultaneously both policy and the latent structure of experts\u2019 demonstrations. In\ninfo GAIL, a posterior distribution of a latent code is learned to cluster multi-modal demonstrations.\nThe posterior distribution is trained to consistently assign the latent code to similar demonstrations and\nOnce the latent codes are assigned to the demonstrations, the policy function conditioned on a latent\ncode is trained to generate the corresponding demonstrations. Different modes in demonstrations are\ncaptured by assigning different latent codes.\n\n6.1 Multi-Goal Environment\n\nTo validate that the proposed method can learn multi-modal behavior of an expert, we design a simple\nmulti-goal environment with four attractors and four repulsors, where an agent tries to reach one of\nattractors while avoiding all repulsors as shown in Figure 1(a). The agent follows the point-mass\ndynamics and get a positive reward (resp., a negative reward) when getting closer to an attractor\n(resp., repulsor). Intuitively, this problem has multi-modal optimal actions at the center. We \ufb01rst train\nthe optimal policy using [3] and generate 300 demonstrations from the expert\u2019s policy. For tested\nmethods, 500 episodes are sampled at each iteration. In every iteration, we measure the average\nreturn using the underlying rewards and the reachability which is measured by counting how many\ngoals are reached. If the algorithm captures the multi-modality of expert\u2019s demonstrations, then,\nthe resulting policy will show high reachability. All algorithms run repeatedly with seven different\nrandom seeds.\nThe results are shown in Figure 1(b) and 1(c). Since the rewards are multi-modal, it is easy to get\na high return if the algorithm learns only uni-modal behavior. Hence, the average returns of soft\nGAIL, info GAIL and MCTEIL increases similarly. However, when it comes to the reachability,\nMCTEIL outperforms other methods when they use the same number of mixtures. In particular,\nMCTEIL can learn all modes in demonstrations at the end of learning while soft GAIL and info GAIL\nsuffer from collapsing modes. This advantage clearly comes from the maximum Tsallis entropy of a\nsparse MDN since the analytic form of the Tsallis entropy directly penalizes collapsed mixture means\nwhile \u2212 log(\u03c0(a|s)) indirectly prevents modes collapsing in soft GAIL. Furthermore, info-GAIL\nalso shows mode collapsing while the proposed method can learn every modes. Since info-GAIL\nhas to train a posterior distribution over the latent code to separate demonstrations, it requires more\niterations for reaching all modes as well as prone to the mode collapsing problems. Consequently, we\ncan conclude that the MCTEIL ef\ufb01ciently utilizes each mixture for wide-spread exploration.\n\n6.2 Continuous Control Environment\n\nWe test MCTEIL with a sparse MDN on MuJoCo [10], which is a physics-based simulator, using\nHalfcheetah, Walker2d, Reacher, and Ant. We train the expert policy distribution using trust region\npolicy optimization (TRPO) [29] under the true reward function and generate 50 demonstrations\nfrom the expert policy. We run algorithms with varying numbers of demonstrations, 4, 11, 18, and\n25, and all experiments have been repeated three times with different random seeds. To evaluate\nthe performance of each algorithm, we sample 50 episodes from the trained policy and measure the\n\n8\n\n\f(a) Multi-Goal Environment\n\n(b) Average Return\n\n(c) Reachability\n\nFigure 1: (a) The environment and multi-modal demonstrations are shown. The contour shows the\nunderlying reward map. (b) The average return during training. (c) The reachability during training,\nwhere k is the number of mixtures, c is a dimension of the latent code, and \u03b1 is a regularization\ncoef\ufb01cient.\n\nFigure 2: Average returns of trained policies. For soft GAIL and MCTEIL, k indicates the number of\nmixture and \u03b1 is an entropy regularization coef\ufb01cient. A dashed line indicates the performance of an\nexpert.\n\naverage return value using the underlying rewards. For methods using an MDN, we use the best\nnumber of mixtures using a brute force search.\nThe results are shown in Figure 2. For three problems, except Walker2d, MCTEIL outperforms the\nother methods with respect to the average return as the number of demonstrations increases. For\nWalker2d, MCTEIL and soft GAIL show similar performance. Especially, in the reacher problem, we\nobtain the similar results reported in [9], where BC works better than GAIL. However, our method\nshows the best performance for all demonstration counts. It is observed that the MDN policy tends to\nshow high performance consistently since MCTEIL and soft GAIL are consistently ranked within\nthe top two high performing algorithms. From these results, we can conclude that an MDN policy\nexplores better than a single Gaussian policy since an MDN can keep searching multiple directions\nduring training. In particular, since the maximum Tsallis entropy makes each mixture mean explore\nin different directions and a sparsemax distribution assigns zero weight to unnecessary mixture\ncomponents, MCTEIL ef\ufb01ciently explores and shows better performance compared to soft GAIL\nwith a soft MDN. Consequently, we can conclude that MCTEIL outperforms other imitation learning\nmethods and the causal Tsallis entropy has bene\ufb01ts over the causal Gibbs-Shannon entropy as it\nencourages exploration more ef\ufb01ciently.\n\n7 Conclusion\n\nIn this paper, we have proposed a novel maximum causal Tsallis entropy (MCTE) framework, which\ninduces a sparsemax distribution as the optimal solution. We have also provided the full mathematical\nanalysis of the proposed framework, including the concavity of the problem, the optimality condition,\nand the interpretation as robust Bayes. We have also developed the maximum causal Tsallis entropy\nimitation learning (MCTEIL) algorithm, which can ef\ufb01ciently solve a MCTE problem in a continuous\naction space since the Tsallis entropy of a mixture of Gaussians encourages exploration and ef\ufb01cient\nmixture utilization. In experiments, we have veri\ufb01ed that the proposed method has advantages over\nexisting methods for learning the multi-modal behavior of an expert since a sparse MDN can search\nin diverse directions ef\ufb01ciently. Furthermore, the proposed method has outperformed BC, GAIL, and\nGAIL with a soft MDN on the standard IL problems in the MuJoCo environment. From the analysis\nand experiments, we have shown that the proposed MCTEIL method is an ef\ufb01cient and principled\nway to learn the multi-modal behavior of an expert.\n\n9\n\n\fAcknowledgments\n\nThis work was supported in part by Basic Science Research Program through the National Research\nFoundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017R1A2B2006136)\nand by the Brain Korea 21 Plus Project in 2018.\n\nReferences\n[1] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, \u201cMaximum entropy inverse reinforce-\nment learning,\u201d in Proceedings of the 23rd International Conference on Arti\ufb01cial Intelligence,\nJuly 2008, pp. 1433\u20131438.\n\n[2] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, \u201cReinforcement learning with deep energy-\nbased policies,\u201d in Proceedings of the 34th International Conference on Machine Learning,\nAugust 2017, pp. 1352\u20131361.\n\n[3] K. Lee, S. Choi, and S. Oh, \u201cSparse Markov decision processes with causal sparse Tsallis\nentropy regularization for reinforcement learning,\u201d IEEE Robotics and Automation Letters,\nvol. 3, no. 3, pp. 1466\u20131473, 2018.\n\n[4] N. Heess, D. Silver, and Y. W. Teh, \u201cActor-critic reinforcement learning with energy-based\npolicies,\u201d in Proceedings of the 10th European Workshop on Reinforcement Learning, June\n2012, pp. 43\u201358.\n\n[5] P. Vamplew, R. Dazeley, and C. Foale, \u201cSoftmax exploration strategies for multiobjective\n\nreinforcement learning,\u201d Neurocomputing, vol. 263, pp. 74\u201386, Jun 2017.\n\n[6] A. F. T. Martins and R. F. Astudillo, \u201cFrom softmax to sparsemax: A sparse model of attention\nand multi-label classi\ufb01cation,\u201d in Proceedings of the 33nd International Conference on Machine\nLearning, June 2016, pp. 1614\u20131623.\n\n[7] M. Bloem and N. Bambos, \u201cIn\ufb01nite time horizon maximum causal entropy inverse reinforcement\nlearning,\u201d in Proceedings of the 53rd International Conference on Decision and Control,\nDecember 2014, pp. 4911\u20134916.\n\n[8] Y. Chow, O. Nachum, and M. Ghavamzadeh, \u201cPath consistency learning in tsallis entropy\nregularized mdps,\u201d in Proceedings of the International Conference on Machine Learning, July\n2018, pp. 978\u2013987.\n\n[9] J. Ho and S. Ermon, \u201cGenerative adversarial imitation learning,\u201d in Advances in Neural Infor-\n\nmation Processing Systems, December 2016, pp. 4565\u20134573.\n\n[10] E. Todorov, T. Erez, and Y. Tassa, \u201cMuJoCo: A physics engine for model-based control,\u201d in\nProceedings of the International Conference on Intelligent Robots and Systems, October 2012,\npp. 5026\u20135033.\n\n[11] P. Abbeel and A. Y. Ng, \u201cApprenticeship learning via inverse reinforcement learning,\u201d in\n\nProceedings of the 21st International Conference of Machine Learning, July 2004.\n\n[12] N. D. Ratliff, J. A. Bagnell, and M. Zinkevich, \u201cMaximum margin planning,\u201d in Proc. of the\n\n23rd International Conference on Machine learning, June 2006.\n\n[13] D. Ramachandran and E. Amir, \u201cBayesian inverse reinforcement learning,\u201d in Proceedings of\n\nthe 20th International Joint Conference on Arti\ufb01cial Intelligence, January 2007.\n\n[14] S. Levine, Z. Popovic, and V. Koltun, \u201cNonlinear inverse reinforcement learning with gaussian\n\nprocesses,\u201d in Advances in Neural Information Processing Systems, 2011, pp. 19\u201327.\n\n[15] J. Zheng, S. Liu, and L. M. Ni, \u201cRobust bayesian inverse reinforcement learning with sparse\nbehavior noise,\u201d in Proc. of the 28th AAAI Conference on Arti\ufb01cial Intelligence. AAAI Press,\nJuly 2014.\n\n[16] J. Choi and K.-E. Kim, \u201cHierarchical bayesian inverse reinforcement learning,\u201d Cybernetics,\n\nIEEE Transactions on, vol. 45, no. 4, pp. 793\u2013805, 2015.\n\n10\n\n\f[17] J. Choi and K. Kim, \u201cBayesian nonparametric feature construction for inverse reinforcement\nlearning,\u201d in Proceedings of the 23rd International Joint Conference on Arti\ufb01cial Intelligence.\nIJCAI/AAAI, August 2013.\n\n[18] M. Wulfmeier, P. Ondruska, and I. Posner, \u201cMaximum entropy deep inverse reinforcement\n\nlearning,\u201d arXiv preprint arXiv:1507.04888, 2015.\n\n[19] K. Hausman, Y. Chebotar, S. Schaal, G. S. Sukhatme, and J. J. Lim, \u201cMulti-modal imitation\nlearning from unstructured demonstrations using generative adversarial nets,\u201d in Advances in\nNeural Information Processing Systems, December 2017, pp. 1235\u20131245.\n\n[20] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and N. Heess, \u201cRobust imitation of\ndiverse behaviors,\u201d in Advances in Neural Information Processing Systems, December 2017, pp.\n5326\u20135335.\n\n[21] Y. Li, J. Song, and S. Ermon, \u201cInfogail: Interpretable imitation learning from visual demonstra-\ntions,\u201d in Advances in Neural Information Processing Systems, December 2017, pp. 3815\u20133825.\n\n[22] B. D. Ziebart, \u201cModeling purposeful adaptive behavior with the principle of maximum causal en-\ntropy,\u201d Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, USA, 2010, aAI3438449.\n\n[23] U. Syed and R. E. Schapire, \u201cA game-theoretic approach to apprenticeship learning,\u201d in Ad-\n\nvances in neural information processing systems, December 2007, pp. 1449\u20131456.\n\n[24] G. W. Brier, \u201cVeri\ufb01cation of forecasts expressed in terms of probability,\u201d Monthey Weather\n\nReview, vol. 78, no. 1, pp. 1\u20133, 1950.\n\n[25] U. Syed, M. Bowling, and R. E. Schapire, \u201cApprenticeship learning using linear programming,\u201d\nin Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp.\n1032\u20131039.\n\n[26] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.\n\nWiley & Sons, 2014.\n\nJohn\n\n[27] P. D. Gr\u00fcnwald and A. P. Dawid, \u201cGame theory, maximum entropy, minimum discrepancy and\n\nrobust Bayesian decision theory,\u201d Annals of Statistics, pp. 1367\u20131433, 2004.\n\n[28] P. W. Millar, \u201cThe minimax principle in asymptotic statistical theory,\u201d in Ecole d\u2019Et\u00e9 de\n\nProbabilit\u00e9s de Saint-Flour XI\u20141981. Springer, 1983, pp. 75\u2013265.\n\n[29] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, \u201cTrust region policy optimization,\u201d\nin Proceedings of the 32nd International Conference on Machine Learning, July 2015, pp. 1889\u2013\n1897.\n\n11\n\n\f", "award": [], "sourceid": 2167, "authors": [{"given_name": "Kyungjae", "family_name": "Lee", "institution": "Seoul National University"}, {"given_name": "Sungjoon", "family_name": "Choi", "institution": "Disney Research"}, {"given_name": "Songhwai", "family_name": "Oh", "institution": "Seoul National University"}]}