{"title": "Regret Minimization in MDPs with Options without Prior Knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 3166, "page_last": 3176, "abstract": "The option framework integrates temporal abstraction into the reinforcement learning model through the introduction of macro-actions (i.e., options). Recent works leveraged on the mapping of Markov decision processes (MDPs) with options to semi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation algorithms (e.g., RMAX-SMDP and UCRL-SMDP) to analyze the impact of options on the learning performance. Nonetheless, the PAC-SMDP sample complexity of RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical guarantees, while UCRL-SMDP requires prior knowledge of the parameters characterizing the distributions of the cumulative reward and duration of each option, which are hardly available in practice. In this paper, we remove this limitation by combining the SMDP view together with the inner Markov structure of options into a novel algorithm whose regret performance matches UCRL-SMDP's up to an additive regret term. We show scenarios where this term is negligible and the advantage of temporal abstraction is preserved. We also report preliminary empirical result supporting the theoretical findings.", "full_text": "Regret Minimization in MDPs with Options\n\nwithout Prior Knowledge\n\nRonan Fruit\n\nSequel Team - Inria Lille\nronan.fruit@inria.fr\n\nMatteo Pirotta\n\nSequel Team - Inria Lille\n\nmatteo.pirotta@inria.fr\n\nAlessandro Lazaric\n\nSequel Team - Inria Lille\n\nalessandro.lazaric@inria.fr\n\nEmma Brunskill\nStanford University\n\nebrun@cs.stanford.edu\n\nAbstract\n\nThe option framework integrates temporal abstraction into the reinforcement learn-\ning model through the introduction of macro-actions (i.e., options). Recent works\nleveraged the mapping of Markov decision processes (MDPs) with options to\nsemi-MDPs (SMDPs) and introduced SMDP-versions of exploration-exploitation\nalgorithms (e.g., RMAX-SMDP and UCRL-SMDP) to analyze the impact of options\non the learning performance. Nonetheless, the PAC-SMDP sample complexity\nof RMAX-SMDP can hardly be translated into equivalent PAC-MDP theoretical\nguarantees, while the regret analysis of UCRL-SMDP requires prior knowledge of\nthe distributions of the cumulative reward and duration of each option, which are\nhardly available in practice. In this paper, we remove this limitation by combining\nthe SMDP view together with the inner Markov structure of options into a novel\nalgorithm whose regret performance matches UCRL-SMDP\u2019s up to an additive\nregret term. We show scenarios where this term is negligible and the advantage\nof temporal abstraction is preserved. We also report preliminary empirical results\nsupporting the theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nTractable learning of how to make good decisions in complex domains over many time steps almost\nde\ufb01nitely requires some form of hierarchical reasoning. One powerful and popular framework for\nincorporating temporally-extended actions in the context of reinforcement learning is the options\nframework [1]. Creating and leveraging options has been the subject of many papers over the last two\ndecades (see e.g., [2, 3, 4, 5, 6, 7, 8]) and it has been of particular interest recently in combination\nwith deep reinforcement learning, with a number of impressive empirical successes (see e.g., [9] for\nan application to Minecraft). Intuitively (and empirically) temporal abstraction can help speed up\nlearning (reduce the amount of experience needed to learn a good policy) by shaping the actions\nselected towards more promising sequences of actions [10], and it can reduce planning computation\nthrough reducing the need to evaluate over all possible actions (see e.g., Mann and Mannor [11]).\nHowever, incorporating options does not always improve learning ef\ufb01ciency as shown by Jong et al.\n[12]. Intuitively, limiting action selection only to temporally-extended options might hamper the\nexploration of the environment by restricting the policy space. Therefore, we argue that in addition to\nthe exciting work being done in heuristic and algorithmic approaches that leverage and/or dynamically\ndiscover options, it is important to build a formal understanding of how and when options may help\nor hurt reinforcement learning performance, and that such insights may also help inform empirically\nmotivated options-RL research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThere has been fairly limited work on formal performance bounds of RL with options. Brunskill and\nLi [13] derived sample complexity bounds for an RMAX-like exploration-exploitation algorithm for\nsemi-Markov decision processes (SMDPs). While MDPs with options can be mapped to SMDPs,\ntheir analysis cannot be immediately translated into the PAC-MDP sample complexity of learning\nwith options, which makes it harder to evaluate their potential bene\ufb01t. Fruit and Lazaric [14] analyzed\nan SMDP variant of UCRL [15] showing how its regret can be mapped to the regret of learning in the\noriginal MDP with options. The resulting analysis explicitly showed how options can be bene\ufb01cial\nwhenever the navigability among the states in the original MDP is not compromised (i.e., the MDP\ndiameter is not signi\ufb01cantly increased), the level of temporal abstraction is high (i.e., options have\nlong durations, thus reducing the number of decision steps), and the optimal policy with options\nperforms as well as the optimal policy using primitive actions. While this result makes explicit the\nimpact of options on the learning performance, the proposed algorithm (UCRL-SMDP, or SUCRL\nin short) needs prior knowledge on the parameters of the distributions of cumulative rewards and\ndurations of each option to construct con\ufb01dence intervals and compute optimistic solutions. In\npractice this is often a strong requirement and any incorrect parametrization (e.g., loose upper-bounds\non the true parameters) directly translates into a poorer regret performance. Furthermore, even if\na hand-designed set of options may come with accurate estimates of their parameters, this would\nnot be possible for automatically generated options, which are of increasing interest to the deep RL\ncommunity. Finally, this prior work views each option as a distinct and atomic macro-action, thus\nlosing the potential bene\ufb01t of considering the inner structure and the interaction between of options,\nwhich could be used to signi\ufb01cantly improve sample ef\ufb01ciency.\nIn this paper we remove the limitations of prior theoretical analyses. In particular, we combine the\nsemi-Markov decision process view on options and the intrinsic MDP structure underlying their\nexecution to achieve temporal abstraction without relying on parameters that are typically unknown.\nWe introduce a transformation mapping each option to an associated irreducible Markov chain and\nwe show that optimistic policies can be computed using only the stationary distributions of the\nirreducible chains and the SMDP dynamics (i.e., state to state transition probabilities through options).\nThis approach does not need to explicitly estimate cumulative rewards and duration of options\nand their con\ufb01dence intervals. We propose two alternative implementations of a general algorithm\n(FREE-SUCRL, or FSUCRL in short) that differs in whether the stationary distribution of the options\u2019\nirreducible Markov chains and its con\ufb01dence intervals are computed explicitly or implicitly through\nan ad-hoc extended value iteration algorithm. We derive regret bounds for FSUCRL that match the\nregret of SUCRL up to an additional term accounting for the complexity of estimating the stationary\ndistribution of an irreducible Markov chain starting from its transition matrix. This additional regret\nis the, possibly unavoidable, cost to pay for not having prior knowledge on options. We further the\ntheoretical \ufb01ndings with a series of simple grid-world experiments where we compare FSUCRL to\nSUCRL and UCRL (i.e., learning without options).\n\n2 Preliminaries\n\nLearning in MDPs with options. A \ufb01nite MDP is a tuple M =(cid:8)S,A, p, r(cid:9) where S is the set of\npolicy \u03c0 : S \u2192 A maps states to actions. We de\ufb01ne an option as a tuple o =(cid:8)so, \u03b2o, \u03c0o\n(cid:9) where\ndecision processes (SMDP) MO =(cid:8)SO,Os, pO, RO, \u03c4O(cid:9) where SO \u2286 S is the set of states where\n\nstates, A is the set of actions, p(s(cid:48)|s, a) is the probability of transition from state s to state s(cid:48) through\naction a, r(s, a) is the random reward associated to (s, a) with expectation r(s, a). A deterministic\nso \u2208 S is the state where the option can be initiated1, \u03c0o : S \u2192 A is the associated stationary Markov\npolicy, and \u03b2o : S \u2192 [0, 1] is the probability of termination. As proved by Sutton et al. [1], when\nprimitive actions are replaced by a set of options O, the resulting decision process is a semi-Markov\noptions can start and end, Os is the set of options available at state s, pO(s(cid:48)|s, o) is the probability\nof terminating in s(cid:48) when starting o from s, RO(s, o) is the (random) cumulative reward obtained\nby executing option o from state s until interruption at s(cid:48) with expectation RO(s, o), and \u03c4O(s, o) is\nthe duration (i.e., number of actions executed to go from s to s(cid:48) by following \u03c0o) with expectation\n\u03c4 (s, o).2 Throughout the rest of the paper, we assume that options are well de\ufb01ned.\n\n1Restricting the standard initial set to one state so is without loss of generality (see App. A).\n2Notice that RO(s, o) (similarly for \u03c4O) is well de\ufb01ned only when s = so, that is when o \u2208 Os.\n\n2\n\n\fAssumption 1. The set of options O is admissible, that is 1) all options terminate in \ufb01nite time with\nprobability 1, 2), in all possible terminal states there exists at least one option that can start, i.e.,\n\u222ao\u2208O{s : \u03b2o(s) > 0} \u2286 \u222ao\u2208O{so}, 3) the resulting SMDP MO is communicating.\nLem. 3 in [14] shows that under Asm. 1 the family of SMDPs induced by using options in MDPs\nis such that for any option o, the distributions of the cumulative reward and the duration are sub-\nExponential with bounded parameters (\u03c3r(o), br(o)) and (\u03c3\u03c4 (o), b\u03c4 (o)) respectively. The maximal\nexpected duration is denoted by \u03c4max = maxs,o {\u03c4O(s, o)}. Let t denote primitive action steps and\nlet i index decision steps at option level. The number of decision steps up to (primitive) step t is\ni=1 \u03c4i is the number of primitive steps executed over n\ndecision steps and \u03c4i is the (random) number of steps before the termination of the option chosen at\nstep i. Under Asm. 1 there exists a policy \u03c0\u2217 : S \u2192 O over options that achieves the largest gain\n(per-step reward)\n\nN (t) = max(cid:8)n : Tn \u2264 t(cid:9), where Tn =(cid:80)n\n\n(cid:20)(cid:80)N (t)\n\ni=1 Ri\n\nt\n\n(cid:21)\n\n,\n\n\u03c1\u2217\nO def\n\n= max\n\n\u03c0\n\n\u03c1\u03c0O = max\n\n\u03c0\n\nt\u2192+\u221e E\u03c0\nlim\n\n(cid:26) RO(s, o)\n\n\u03c4O(s, o)\n\n(cid:18)(cid:88)\n\n1\n\n+\n\n\u03c4O(s, o)\n\ns(cid:48)\u2208S\n\n(1)\n\n(cid:19)(cid:27)\n\nwhere Ri is the reward cumulated by the option executed at step i. The optimal gain also satis\ufb01es the\noptimality equation of an equivalent MDP obtained by data-transformation (Lem. 2 in [16]), i.e.,\n\n\u2200s \u2208 S \u03c1\u2217\n\nO = max\no\u2208Os\n\npO(s(cid:48)|s, o)u\u2217\n\nO(s(cid:48)\n\n) \u2212 u\u2217\n\nO(s)\n\n,\n\n(2)\n\n\u2206(A, n) =(cid:0)(cid:80)n\n\nO is the optimal bias and Os is the set of options than can be started in s (i.e., o \u2208 Os \u21d4\nwhere u\u2217\nso = s). In the following sections, we drop the dependency on the option set O from all previous\nterms whenever clear from the context. Given the optimal average reward \u03c1\u2217\nO, we evaluate the\nperformance of a learning algorithm A by its cumulative (SMDP) regret over n decision steps as\ni=1 Ri. In [14] it is shown that \u2206(A, n) is equal to the MDP regret\nup to a linear \u201capproximation\u201d regret accounting for the difference between the optimal gains of M\non primitive actions and the associated SMDP MO.\n\n(cid:1)\u03c1\u2217\nO \u2212(cid:80)n\n\ni=1 \u03c4i\n\n3 Parameter-free SUCRL for Learning with Options\n\nOptimism in SUCRL. At each episode, SUCRL runs a variant of extended value iteration (EVI) [17]\nto solve the \u201coptimistic\u201d version of the data-transformation optimality equation in Eq. 2, i.e.,\n\n(cid:101)\u03c1\u2217\n\n= max\no\u2208Os\n\n(cid:40)\n\n(cid:26)(cid:101)R(s, o)\n(cid:101)\u03c4 (s, o)\n\nmax(cid:101)R,(cid:101)\u03c4\n\n(cid:16)\n\n1(cid:101)\u03c4 (s, o)\n\n+\n\nmax(cid:101)p\n\n(cid:110)(cid:88)\n\n(cid:101)p(s(cid:48)|s, o)(cid:101)u\u2217\n\n(s(cid:48)\n\n)\n\n(cid:111) \u2212(cid:101)u\u2217\n\ns(cid:48)\u2208S\n\nwhere (cid:101)R and(cid:101)\u03c4 are the vectors of cumulative rewards and durations for all state-option pairs and they\nSect.3 in [14] for the exact expression). Similarly, con\ufb01dence intervals need to be computed for(cid:101)p,\n\nbelong to con\ufb01dence intervals constructed using parameters (\u03c3r(o), br(o)) and (\u03c3\u03c4 (o), b\u03c4 (o)) (see\n\nbut this does not require any prior knowledge on the SMDP since the transition probabilities naturally\nbelong to the simplex over states. As a result, without any prior knowledge, such con\ufb01dence intervals\ncannot be directly constructed and SUCRL cannot be run. In the following, we see how constructing\nan irreducible Markov chain (MC) associated to each option avoids this problem.\n\n(cid:17)(cid:27)(cid:41)\n\n(s)\n\n,\n\n(3)\n\n3.1\n\nIrreducible Markov Chains Associated to Options\n\nOptions as absorbing Markov chains. A natural way to address SUCRL\u2019s limitations is to avoid\nconsidering options as atomic operations (as in SMDPs) but take into consideration their inner (MDP)\nstructure. Since options terminate in \ufb01nite time (Asm. 1), they can be seen as an absorbing Markov\nreward process whose state space contains all states that are reachable by the option and where option\nterminal states are absorbing states of the MC (see Fig. 1). More formally, for any option o the set\nof inner states So includes the initial state so and all states s with \u03b2o(s) < 1 that are reachable by\nexecuting \u03c0o from so (e.g., So = {s0, s1} in Fig. 1), while the set of absorbing states Sabs\nincludes\nall states with \u03b2o(s) > 0 (e.g., Sabs\no = {s0, s1, s2} in Fig. 1). The absorbing MC associated to o is\n\no\n\n3\n\n\f1 \u2212 p\n\n1 \u2212 p\n\n\u03b22\n\na0\n\np\n\n\u03b21\n\na0\n\np\n\n\u03b20\n\ns0\n\na1\n\n. . .\n\ns1\n\na1\n\n. . .\n\n(1\u2212p)(1\u2212\u03b20)\n\n(1\u2212p)(1\u2212\u03b21)\n\np(s0|s0, o)\n\np(s2|s0, o)\n\na0\n\n. . .\n\ns2\n\na1\n\n. . .\n\ns0\n\n. . .\n\no\n\np(s1|s0, o)\n\ns1\n\n. . .\n\nso,0\n\n(1\u2212p)\u03b20\n\n(1\u2212\u03b21)p\n\n\u03b21p\n\nSo\n\nso,1\n\np\n\n(1\u2212p)\u03b21\n\ns0\n\n1\n\ns1\n\n1\n\ns2\n\nSabs\n\no\n\n1\n\np(cid:48)\n\nso,0\n\n(1\u2212\u03b21)p\n\np(cid:48)(cid:48)\n\n. . .\n\ns2\n\n. . .\n\n(1\u2212p)(1\u2212\u03b21)\n\nso,1\n\nFigure 1: (upper-left) MDP with an option o starting from s0 and executing a0 in all states with termination\nprobabilities \u03b2o(s0) = \u03b20, \u03b2o(s1) = \u03b21 and \u03b2o(s2) = 1. (upper-right) SMDP dynamics associated to option o.\n(lower-left) Absorbing MC associated to options o. (lower-right) Irreducible MC obtained by transforming the\nassociated absorbing MC with p(cid:48) = (1 \u2212 \u03b20)(1 \u2212 p) + \u03b20(1 \u2212 p) + p\u03b21 and p(cid:48)(cid:48) = \u03b21(1 \u2212 p) + p.\n\ncharacterized by a transition matrix Po of dimension (|So| + |Sabs\n\no |) \u00d7 (|So| + |Sabs\n\no |) de\ufb01ned as3\n\nPo =\n\nwith\n\nQo(s, s(cid:48)) = (1 \u2212 \u03b2o(s(cid:48)))p(s(cid:48)|s, \u03c0o(s)) for any s, s(cid:48) \u2208 So\nVo(s, s(cid:48)) = \u03b2o(s(cid:48))p(s(cid:48)|s, \u03c0o(s)) for any s \u2208 So, s(cid:48) \u2208 Sabs\no ,\n\n(cid:20)Qo Vo\n\n(cid:21)\n\n0\n\nI\n\no\n\n|So| \u00d7 |So|), Vo is the transition\nwhere Qo is the transition matrix between inner states (dim.\no |), and I is the identity matrix (dim.\nmatrix from inner states to absorbing states (dim. |So| \u00d7 |Sabs\no |). As proved in Lem. 3 in [14], the expected cumulative rewards R(s, o), the duration\no | \u00d7 |Sabs\n|Sabs\n\u03c4 (s, o), and the sub-Exponential parameters (\u03c3r(o), br(o)) and (\u03c3\u03c4 (o), b\u03c4 (o)) are directly related to\nthe transition matrices Qo and Vo of the associated absorbing chain Po. This suggests that, given an\nestimate of Po, we could directly derive the corresponding estimates of R(s, o) and \u03c4 (s, o). Following\nthis idea, we could \u201cpropagate\u201d con\ufb01dence intervals on the entries of Po to obtain con\ufb01dence intervals\non rewards and duration estimates without any prior knowledge on their parameters and thus solve\nEq. 3 without any prior knowledge. Nonetheless, intervals on Po do not necessarily translate into\n\ncompact bounds for R and \u03c4. For example, if the value (cid:101)Vo = 0 belongs to the con\ufb01dence interval of\n(cid:101)Po (no state in Sabs\ncan be reached), the corresponding optimistic estimates (cid:101)R(s, o) and(cid:101)\u03c4 (s, o) are\n\nunbounded and Eq. 3 is ill-de\ufb01ned.\nOptions as irreducible Markov chains. We \ufb01rst notice from Eq. 2 that computing the optimal\npolicy only requires computing the ratio R(s, o)/\u03c4 (s, o) and the inverse 1/\u03c4 (s, o). Starting from Po,\nwe can construct an irreducible MC whose stationary distribution is directly related to these terms.\nWe proceed as illustrated in Fig. 1: all terminal states are \u201cmerged\u201d together and their transitions\nare \u201credirected\u201d to the initial state so. More formally, let 1 be the all-one vector of dimension\n|Sabs\no |, then vo = Vo1 \u2208 R|So| contains the cumulative probability to transition from an inner state\nto any terminal state. Then the chain Po can be transformed into a MC with transition matrix\nP (cid:48)\no = [vo Q(cid:48)\no is now an irreducible\nMC as any state can be reached starting from any other state and thus it admits a unique stationary\ndistribution \u00b5o. In order to relate \u00b5o to the optimality equation in Eq. 2, we need an additional\nassumption on the options.\nAssumption 2. For any option o \u2208 O, the starting state so is also a terminal state (i.e., \u03b2o (so) = 1)\nand any state s(cid:48) \u2208 S with \u03b2o(s(cid:48)) < 1 is an inner state (i.e., s(cid:48) \u2208 So).\n\no contains all but the \ufb01rst column of Qo. P (cid:48)\n\no] \u2208 RSo\u00d7So, where Q(cid:48)\n\n3In the following we only focus on the dynamics of the process; similar de\ufb01nitions apply for the rewards.\n\n4\n\n\fInput: Con\ufb01dence \u03b4 \u2208]0, 1[, rmax, S, A, O\nFor episodes k = 1, 2, ... do\n1. Set ik := i, t = tk and episode counters \u03bdk(s, a) = 0, \u03bdk(s, o) = 0\n\no,k,(cid:98)rk(s, a) and their con\ufb01dence intervals in Eq. 6\n\n2. Compute estimates(cid:98)pk(s(cid:48)|s, o), (cid:98)P (cid:48)\n3. Compute an \u0001k-approximation of the optimal optimistic policy(cid:101)\u03c0k of Eq. 5\n(a) Execute option oi =(cid:101)\u03c0k(si), obtain primitive rewards r1\n4. While \u2200l \u2208 [t + 1, t + \u03c4i], \u03bdk(sl, al) < Nk(sl, al) do\n(b) Set \u03bdk(si, oi) += 1, i += 1, t += \u03c4i and \u03bdk(s, \u03c0oi (s)) += 1 for all s \u2208 {s1\n5. Set Nk(s, o) += \u03bdk(s, o) and Nk(s, a) += \u03bdk(s, a)\n\ni , ..., r\u03c4i\n\ni and visited states s1\ni , ..., s\u03c4i\ni }\ni , ..., s\u03c4i\n\ni = si+1\n\nFigure 2: The general structure of FSUCRL.\n\nWhile the \ufb01rst part has a very minor impact on the de\ufb01nition of O, the second part of the assumption\nguarantees that options are \u201cwell designed\u201d as it requires the termination condition to be coherent\nwith the true inner states of the option, so that if \u03b2o(s(cid:48)) < 1 then s(cid:48) should be indeed reachable by the\noption. Further discussion about Asm. 2 is reported in App. A. We then obtain the following property.\nLemma 1. Under Asm. 2, let \u00b5o \u2208 [0, 1]So be the unique stationary distribution of the irreducible\nMC P (cid:48)\n\no associated to option o, then 4\n\u2200s \u2208 S, \u2200o \u2208 Os,\n\n1\n\n= \u00b5o(s)\n\nand\n\nR(s, o)\n\u03c4 (s, o)\n\n=\n\n\u03c4 (s, o)\n\nr(s(cid:48), \u03c0o(s(cid:48)\n\n))\u00b5o(s(cid:48)\n\n).\n\n(4)\n\n(cid:88)\n\ns(cid:48)\u2208So\n\n(cid:16)\n\ns(cid:48)\u2208So\n\n(cid:40)\n\n= max\no\u2208Os\n\n(s)\n\n,\n\n(5)\n\nmax(cid:101)bo\n\nmax(cid:101)\u00b5o,(cid:101)ro\n\n(cid:101)ro (s(cid:48)\n\n)(cid:101)\u00b5o(s(cid:48)\n\n(cid:17)(cid:27)(cid:41)\n\n(cid:8)(cid:101)b\no(cid:101)u\u2217(cid:9) \u2212(cid:101)u\u2217\n\n(cid:124)\n\n(cid:26) (cid:88)\n\nThis lemma illustrates the relationship between the stationary distribution of P (cid:48)\nEq. 2.5 As a result, we can apply Lem. 1 to Eq. 3 and obtain the optimistic optimality equation\n\no and the key terms in\n\n) +(cid:101)\u00b5o(s)\n\n\u2200s \u2208 S (cid:101)\u03c1\u2217\nwhere(cid:101)ro (s(cid:48)) =(cid:101)r (s(cid:48), \u03c0o(s(cid:48))) and(cid:101)bo = ((cid:101)p(s(cid:48)|s, o))s(cid:48)\u2208S. Unlike in the absorbing MC case, where\ncompact con\ufb01dence sets for Po may lead to unbounded optimistic estimates for (cid:101)R and(cid:101)\u03c4, in this for-\n\nmulation \u00b5o(s) can be equal to 0 (i.e., in\ufb01nite duration and cumulative reward) without compromising\nthe solution of Eq. 5. Furthermore, estimating \u00b5o implicitly leverages over the correlation between\ncumulative reward and duration, which is ignored when estimating R(s, o) and \u03c4 (s, o) separately.\nFinally, we prove the following result.\n\nLemma 2. Let(cid:101)ro \u2208 R,(cid:101)bo \u2208 P, and(cid:101)\u00b5o \u2208 M, with R, P, M compact sets containing the true\nparameters ro, bo and \u00b5o, then the optimality equation in Eq. 5 always admits a unique solution(cid:101)\u03c1\u2217\nand(cid:101)\u03c1\u2217 \u2265 \u03c1\u2217 (i.e., the solution of Eq. 5 is an optimistic gain).\nNow, we need to provide an explicit algorithm to compute the optimistic optimal gain(cid:101)\u03c1\u2217 of Eq. 5 and\n\nits associated optimistic policy. In the next section, we introduce two alternative algorithms that are\nguaranteed to compute an \u0001-optimistic policy.\n\n3.2\n\nSUCRL with Irreducible Markov Chains\n\nThe structure of the UCRL-like algorithm for learning with options but with no prior knowledge on\ndistribution parameters (called FREE-SUCRL, or FSUCRL) is reported in Fig. 2. Unlike SUCRL we\ndo not directly estimate the expected cumulative reward and duration of options but we estimate the\nSMDP transition probabilities p(s(cid:48)|s, o), the irreducible MC P (cid:48)\no associated to each option, and the\nstate-action reward r(s, a). For all these terms we can compute con\ufb01dence intervals (Hoeffding and\nempirical Bernstein) without any prior knowledge as\n\n4Notice that since option o is de\ufb01ned in s, then s = so. Furthermore r is the MDP expected reward.\n5Lem. 4 in App. D extends this result by giving an interpretation of \u00b5o(s(cid:48)), \u2200s(cid:48) \u2208 So.\n\n5\n\n\fk(s, a) \u221d rmax\n\n(cid:12)(cid:12)r(s, a) \u2212(cid:98)rk(s, a)(cid:12)(cid:12) \u2264 \u03b2r\n(cid:12)(cid:12)p(s\n(cid:48)|s, o)(cid:12)(cid:12) \u2264 \u03b2p\n(cid:48)|s, o) \u2212(cid:98)pk(s\n)(cid:12)(cid:12) \u2264 \u03b2P\n(cid:12)(cid:12)P\n) \u2212 (cid:98)P\n\n(cid:48)\no,k(s, s\n\n(cid:48)\no(s, s\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\nk(s, o, s\n\n(cid:48)\nk (s, o, s\n\n(cid:115)\n\n) \u221d\n\n) \u221d\n\nlog(SAtk/\u03b4)\n\n(cid:115)\n(cid:115)\n\n,\n\nNk(s, a)\n\n2(cid:98)pk(s(cid:48)|s, o)(cid:0)1 \u2212(cid:98)pk(s(cid:48)|s, o))ctk,\u03b4\no,k(s, s(cid:48))(cid:0)1 \u2212 (cid:98)P (cid:48)\n2(cid:98)P (cid:48)\n\no,k(s, s(cid:48)))dtk,\u03b4\n\nNk(s, o)\n\nNk(s, \u03c0o(s))\n\n(6a)\n\n(6b)\n\n+\n\n7ctk,\u03b4\n\n3Nk(s, o)\n\n,\n\n+\n\n7dtk,\u03b4\n\n3Nk(s, \u03c0o(s))\n\n,\n\n(6c)\n\n(cid:40)(cid:88)\n\n(cid:40)\nmax(cid:101)\u00b5o\n\n(cid:101)ro (s(cid:48)\n\n)(cid:101)\u00b5o(s(cid:48)\n\n) +(cid:101)\u00b5o(s)\n\n(cid:110)(cid:101)b\n\n(cid:124)\nouj\n\n(cid:111) \u2212 uj(s)\n\n(cid:19)(cid:41)(cid:41)\n\n(cid:18)\n\nmax(cid:101)bo\n\nwhere Nk(s, a) (resp. Nk(s, o)) is the number of samples collected at state-action s, a (resp. state-\noption s, o) up to episode k, Eq. 6a coincides with the one used in UCRL, in Eq. 6b s = so\nand s(cid:48) \u2208 S, and in Eq. 6c s, s(cid:48) \u2208 So. Finally, we set ctk,\u03b4 = O (log (SOtk)/\u03b4)) and dtk,\u03b4 =\nO (log (|So| log(tk)/\u03b4)) [18, Eq. 31].\nTo obtain an actual implementation of the algorithm reported on Fig. 2 we need to de\ufb01ne a procedure\nto compute an approximation of Eq. 5 (step 3). Similar to UCRL and SUCRL, we de\ufb01ne an EVI\nalgorithm starting from a function u0(s) = 0 and computing at each iteration j\n\n+uj(s),\n\n(7)\n\nuj+1(s) = max\no\u2208Os\n\n(cid:124)\n\ns(cid:48)\u2208So\n\nwhere(cid:101)ro(s(cid:48)) is the optimistic reward (i.e., estimate plus the con\ufb01dence bound of Eq. 6a) and the\noptimistic transition probability vector(cid:101)bo is computed using the algorithm introduced in [19, App.\n\no(cid:98)P (cid:48)\no = (cid:98)\u00b5\n\n(cid:107)\u00b5o \u2212(cid:98)\u00b5o(cid:107)1 \u2264 \u03b2\u00b5\n\no, let(cid:98)\u00b5o be the solution of (cid:98)\u00b5\n\nA] for Bernstein bound as in Eqs. 6b, 6c or in [15, Fig. 2] for Hoeffding bound (see App. B).\nDepending on whether con\ufb01dence intervals for \u00b5o are computed explicitly or implicitly we can de\ufb01ne\ntwo alternative implementations that we present below.\n\nExplicit con\ufb01dence intervals. Given the estimate (cid:98)P (cid:48)\nconstraint(cid:98)\u00b5\n\no under\no is computed after terminating\nthe option at least once and is thus irreducible. The perturbation analysis in [20] can be applied to\nderive the con\ufb01dence interval\n\noe = e. Such a(cid:98)\u00b5o always exists and is unique since (cid:98)P (cid:48)\no \u2212 (cid:98)P (cid:48)\n\nwhere (cid:107)\u00b7(cid:107)\u221e,1 is the maximum of the (cid:96)1-norm of the rows of the transition matrix,(cid:98)\u03bao,min is the\nsmallest condition number6 for the (cid:96)1-norm of \u00b5o. Let \u03b6o \u2208 R|So| be such that \u03b6o(so) =(cid:101)ro(so) +\n(cid:9) \u2212 uj(so) and \u03b6o(s) = (cid:101)ro(s), then the maximum over (cid:101)\u00b5o in Eq. 7 has the same\nmax(cid:101)bo\nAlg. [15, Fig. 2] with parameters(cid:98)\u00b5o, \u03b2\u00b5\nform as the innermost maximum over bo (with Hoeffding bound) and thus we can directly apply\nk (o), and states So ordered descendingly according to \u03b6o. The\nresulting value is then directly plugged into Eq. 7 and uj+1 is computed. We refer to this algorithm\nas FSUCRLV1.\nNested extended value iteration. An alternative approach builds on the observation that the maxi-\nmum over \u00b5o in Eq. 7 can be seen as the optimization of the average reward (gain)\n\nk (o) :=(cid:98)\u03bao,min(cid:107)P (cid:48)\n\no(cid:107)\u221e,1,\n\n(cid:8)(cid:101)b\n\n(cid:124)\nouj\n\n(8)\n\n(cid:124)\n\n(cid:124)\n\n(cid:101)\u03c1\u2217\no(uj) = max(cid:101)\u00b5o\n\n(cid:40)(cid:88)\n\ns(cid:48)\u2208So\n\n(cid:41)\n\n)(cid:101)\u00b5o(s(cid:48)\n\n\u03b6o(s(cid:48)\n\n)\n\n,\n\n(9)\n\n(cid:40)\n\nstate space So, an action space composed of the option action (i.e., \u03c0o(s)), and transitions (cid:101)P (cid:48)\n\nwhere \u03b6o is de\ufb01ned as above. Eq. 9 is indeed the optimal gain of a bounded-parameter MDP with\no in the\n\ncon\ufb01dence intervals 7 of Eq. 6c, and thus we can write its optimality equation\n\n(cid:101)\u03c1\u2217\no(uj) = max(cid:101)P (cid:48)\n(cid:98)\u03bao,min = \u03c41((cid:98)Zo) = maxi,j\n2(cid:107)(cid:98)Zo(i, :)\u2212(cid:98)Zo(j, :)(cid:107)1 where (cid:98)Zo(i, :) is the i-th row of (cid:98)Zo = (I \u2212(cid:98)P (cid:48)\n7The con\ufb01dence intervals on (cid:101)P (cid:48)\n\n(cid:124)(cid:98)\u00b5o)\u22121.\n6The provably smallest condition number (refer to [21, Th. 2.3]) is the one provided by Seneta [22]:\no can never exclude a non-zero transition between any two states of So. There-\no(uj) is state-independent.\n\nfore, the corresponding bounded-parameter MDP is always communicating and \u03c1\u2217\n\n(cid:101)P (cid:48)\no(s, s(cid:48)\n\n)(cid:101)w\u2217\no(s(cid:48)\n\n\u2212 (cid:101)w\u2217\n\n(cid:88)\n\n\u03b6o(s) +\n\n(cid:41)\n\no(s),\n\n(10)\n\no + 1\n\ns(cid:48)\n\n)\n\n1\n\no\n\n6\n\n\funtil the stopping condition lo\nvanishing sequence. As wo\n\nwhere (cid:101)w\u2217\nbounded-parameter MDP, thus avoiding to explicitly construct the con\ufb01dence intervals of(cid:101)\u00b5o. As a\n\no is an optimal bias. For any input function v we can compute \u03c1\u2217\n\nresult, we obtain two nested EVI algorithms where, starting from an initial bias function v0(s) = 0,\n8 at any iteration j we set the bias function of the inner EVI to wo\nj,0(s) = 0 and we compute (see\nApp. C.3 for the general EVI for bounded-parameter MDPs and its guarantees)\n\no(v) by using EVI on the\n\n(cid:124)\n\n(cid:111)\n\n(cid:110)\n\u03b6o(s) + (cid:101)Po(\u00b7|s(cid:48))\nj,l+1\u2212wo\n(cid:1)(cid:111)\no(vj) with l, the outer EVI becomes\n\nwo\nj,l\n\n,\n\nj,l+1(s(cid:48)\nwo\nj = inf{l \u2265 0 : sp{wo\nj,l converges to \u03c1\u2217\n\n) = max(cid:101)Po\n(cid:110)\ng(cid:0)wo\n\nj,l+1 \u2212 wo\n\nj,l} \u2264 \u03b5j} is met, where (\u03b5j)j\u22650 is a\n\n(11)\n\nvj+1(s) = max\no\u2208Os\n\n(12)\nwhere g : v (cid:55)\u2192 1\n2 (max{v} + min{v}). In App. C.4 we show that this nested scheme, that we\ncall FSUCRLV2, converges to the solution of Eq. 5. Furthermore, if the algorithm is stopped when\n\nj +1 \u2212 wo\nsp{vj+1 \u2212 vj} + \u03b5j \u2264 \u03b5 then |(cid:101)\u03c1\u2217 \u2212 g(vj+1 \u2212 vj)| \u2264 \u03b5/2.\n\n+ vj(s),\n\nj,lo\nj\n\nj,lo\n\nOne of the interesting features of this algorithm is its hierarchical structure. Nested EVI is operating\non two different time scales by iteratively considering every option as an independent optimistic\nplanning sub-problem (EVI of Eq. 11) and gathering all the results into a higher level planning\nproblem (EVI of Eq. 12). This idea is at the core of the hierarchical approach in RL, but it is not\nalways present in the algorithmic structure, while nested EVI naturally arises from decomposing\nEq. 7 in two value iteration algorithms. It is also worth to underline that the con\ufb01dence intervals\n\nimplicitly generated for(cid:101)\u00b5o are never worse than those in Eq. 8 and they are often much tighter. In\n\npractice the bound of Eq. 8 may be actually worse because of the worst-case scenario considered in\nthe computation of the condition numbers (see Sec. 5 and App. F).\n\n4 Theoretical Analysis\n\nBefore stating the guarantees for FSUCRL, we recall the de\ufb01nition of diameter of M and MO:\n\nD = max\n\ns,s(cid:48)\u2208S min\n\n)(cid:3), DO = max\n\ns,s(cid:48)\u2208SO\n\n\u03c0:S\u2192O E(cid:2)\u03c4\u03c0(s, s(cid:48)\n\nmin\n\n)(cid:3),\n\nwhere \u03c4\u03c0(s, s(cid:48)) is the (random) number of primitive actions to move from s to s(cid:48) following policy \u03c0.\nWe also de\ufb01ne a pseudo-diameter characterizing the \u201ccomplexity\u201d of the inner dynamics of options:\n\n\u03c0:S\u2192A E(cid:2)\u03c4\u03c0(s, s(cid:48)\n(cid:101)DO =\n(cid:9) , \u03ba\u221e\n(cid:8)\u03ba1\n\no\n\nr\u2217\u03ba1\u2217 + \u03c4max\u03ba\u221e\n\u2217\n\n\u221a\n\n\u00b5\u2217\n\n(cid:26)\n\n(cid:27)\n\n= min\no\u2208O\n\nmin\ns\u2208So\n\n\u00b5o(s)\n\nwhere we de\ufb01ne:\n\nr\u2217\n\n= max\n\no\u2208O {sp(ro)} , \u03ba1\u2217 = max\no\u2208O\n\n\u2217 = max\n\no\u2208O {\u03ba\u221e\n\no } , and \u00b5\u2217\n\no and \u03ba\u221e\n\nwith \u03ba1\no the condition numbers of the irreducible MC associated to options o (for the (cid:96)1 and\n(cid:96)\u221e-norm respectively [20]) and sp(ro) the span of the reward of the option. In App. D we prove the\nfollowing regret bound.\nTheorem 1. Let M be a communicating MDP with reward bounded between 0 and rmax = 1 and\nlet O be a set of options satisfying Asm. 1 and 2 such that \u03c3r(s, o) \u2264 \u03c3r, \u03c3\u03c4 (s, o) \u2264 \u03c3\u03c4 , and\n\u03c4 (s, o) \u2264 \u03c4max. We also de\ufb01ne BO = maxs,o supp(p(\u00b7|s, o)) (resp. B = maxs,a supp(p(\u00b7|s, a)) as\nthe largest support of the SMDP (resp. MDP) dynamics. Let Tn be the number of primitive steps\nexecuted when running FSUCRLV2 over n decision steps, then its regret is bounded as\n\n\u221a\n\n(cid:123)(cid:122)\n\n\u2206p\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2206R,\u03c4\n\n\u221a\nn\n\n(cid:125)\n\nSATn + (cid:101)DO\n(cid:123)(cid:122)\n\n\u221a\n\n(cid:124)\n\n\u2206\u00b5\n\n\u221a\n\nDO\n\nSBOOn\n\n+ (\u03c3r + \u03c3\u03c4 )\n\n+\n\nSBOTn\n\n(13)\n\n(cid:19)\n\n(cid:125)\n\n\u2206(FSUCRL, n) = (cid:101)O\n\n(cid:18)\n\n(cid:124)\n\n8We use vj instead of uj since the error in the inner EVI directly affects the value of the function at the outer\n\nEVI, which thus generates a sequence of functions different from (uj).\n\n7\n\n\fComparison to SUCRL. Using the con\ufb01dence intervals of Eq. 6b and a slightly tighter analysis than\nthe one by Fruit and Lazaric [14] (Bernstein bounds and higher accuracy for EVI) leads to a regret\nbound for SUCRL as\n\n\u2206(SUCRL, n) = (cid:101)O\n\n(cid:16)\n\u2206p + \u2206R,\u03c4 +(cid:0)\u03c3+\n(cid:124)\n\n(cid:1)\u221a\n\n(cid:123)(cid:122)\n\nr + \u03c3+\n\u03c4\n\u2206(cid:48)\n\nR,\u03c4\n\n(cid:17)\n\n(cid:125)\n\nSAn\n\n,\n\n(14)\n\n\u221a\n\nr and \u03c3+\n\nwhere \u03c3+\n\u03c4 are upper-bounds on \u03c3r and \u03c3\u03c4 that are used in de\ufb01ning the con\ufb01dence intervals for\n\u03c4 and R that are actually used in SUCRL. The term \u2206p is the regret induced by errors in estimating\nthe SMDP dynamics p(s(cid:48)|s, o), while \u2206R,\u03c4 summarizes the randomness in the cumulative reward\nand duration of options. Both these terms scale as\nn, thus taking advantage of the temporal\nabstraction (i.e., the ratio between the number of primitive steps Tn and the decision steps n). The\nmain difference between the two bounds is then in the last term, which accounts for the regret due to\nthe optimistic estimation of the behavior of the options. In SUCRL this regret is linked to the upper\nbounds on the parameters of R and \u03c4. As shown in Thm.2 in [14], when \u03c3+\n\u03c4 = \u03c3\u03c4 , the\nbound of SUCRL is nearly-optimal as it almost matches the lower-bound, thus showing that \u2206(cid:48)\nR,\u03c4\nis unavoidable. In FSUCRL however, the additional regret \u2206\u00b5 comes from the estimation errors of\nthe per-time-step rewards ro and the dynamic P (cid:48)\no. Similar to \u2206p, these errors are ampli\ufb01ed by the\n\npseudo-diameter (cid:101)DO. While \u2206\u00b5 may actually be the unavoidable cost to pay for removing the prior\nknowledge about options, it is interesting to analyze how (cid:101)DO changes with the structure of the options\n\nr = \u03c3r and \u03c3+\n\n(see App. E for a concrete example). The probability \u00b5o(s) decreases as the probability of visiting\nan inner state s \u2208 So using the option policy. In this case, the probability of collecting samples on\nthe inner transitions is low and this leads to large estimation errors for P (cid:48)\no. These errors are then\npropagated to the stationary distribution \u00b5o through the condition numbers \u03ba (e.g., \u03ba1\no directly follows\nfrom an non-empirical version of Eq. 8). Furthermore, we notice that 1/\u00b5o(s) \u2265 \u03c4o(s) \u2265 |So|,\nsuggesting that \u201clong\u201d or \u201cbig\u201d options are indeed more dif\ufb01cult to estimate. On the other hand, \u2206\u00b5\nbecomes smaller whenever the transition probabilities under policy \u03c0o are supported over a few states\n(B small) and the rewards are similar within the option (sp(ro) small). While in the worst case \u2206\u00b5\nmay actually be much bigger than \u2206(cid:48)\nR,\u03c4 when the parameters of R and \u03c4 are accurately known (i.e.,\nr \u2248 \u03c3r), in Sect. 5 we show scenarios in which the actual performance of FSUCRL\n\u03c4 \u2248 \u03c3\u03c4 and \u03c3+\n\u03c3+\nis close or better than SUCRL and the advantage of learning with options is preserved.\nTo explain why FSUCRL can perform better than SUCRL we point out that FSUCRL\u2019s bound is\nsomewhat worst-case w.r.t. the correlation between options. In fact, in Eq. 6c the error in estimating\nP (cid:48)\no in a state s does not scale with the number of samples obtained while executing option o but\nthose collected by taking the primitive action prescribed by \u03c0o. This means that even if o has a low\nprobability of reaching s starting from so (i.e., \u00b5o(s) is very small), the true error may still be small\nas soon as another option o(cid:48) executes the same action (i.e., \u03c0o(s) = \u03c0o(cid:48)(s)). In this case the regret\nbound is loose and the actual performance of FSUCRL is much better. Therefore, although it is\nnot apparent in the regret analysis, not only is FSUCRL leveraging on the correlation between the\ncumulative reward and duration of a single option, but it is also leveraging on the correlation between\ndifferent options that share inner state-action pairs.\nComparison to UCRL. We recall that the regret of UCRL is bounded as O(D\nSBATn), where\nTn is to the total number of steps. As discussed by [14], the major advantage of options is in terms\nof temporal abstraction (i.e., Tn (cid:29) n) and reduction of the state-action space (i.e., SO < S and\nO < A). Eq.(13) also reveals that options can also improve the learning speed by reducing the size of\nthe support BO of the dynamics of the environment w.r.t. primitive actions. This can lead to a huge\nimprovement e.g., when options are designed so as to reach a speci\ufb01c goal. This potential advantage\nis new compared to [14] and matches the intuition on \u201cgood\u201d options often presented in the literature\n(see e.g., the concept of \u201cfunnel\u201d actions introduced by Dietterich [23]).\n\nBound for FSUCRLV1. Bounding the regret of FSUCRLV1 requires bounding the empirical(cid:98)\u03ba in\nEq. (8) with the true condition number \u03ba. Since(cid:98)\u03ba tends to \u03ba as the number of samples of the option\n(cid:1)\n(cid:0)\u03ba1\n\nincreases, the overall regret would only be increased by a lower order term. In practice however,\nFSUCRLV2 is preferable to FSUCRLV1. The latter will suffer from the true condition numbers\no\u2208O since they are used to compute the con\ufb01dence bounds on the stationary distributions\n(\u00b5o)o\u2208O, while for FSUCRLV2 they appear only in the analysis. As much as the dependency on the\ndiameter in the analysis of UCRL, the condition numbers may also be loose in practice, although\ntight from a theoretical perspective. See App.D.6 and experiments for further insights.\n\n\u221a\n\no\n\n8\n\n\fFigure 3: (Left) Regret after 1.2 \u00b7 108 steps normalized w.r.t. UCRL for different option durations in a 20x20\ngrid-world. (Right) Evolution of the regret as Tn increases for a 14x14 four-rooms maze.\n\n5 Numerical Simulations\n\nIn this section we compare the regret of FSUCRL to SUCRL and UCRL to empirically verify\nthe impact of removing prior knowledge about options and estimating their structure through the\nirreducible MC transformation. We consider the toy domain presented in [14] that was speci\ufb01cally\ndesigned to show the advantage of temporal abstraction and the classical 4-rooms maze [1]. To be\nable to reproduce the results of [14], we run our algorithm with Hoeffding con\ufb01dence bounds for the\n(cid:96)1-deviation of the empirical distribution (implying that BO has no impact). We consider settings\nwhere \u2206R,\u03c4 is the dominating term of the regret (refer to App. F for details).\nWhen comparing the two versions of FSUCRL to UCRL on the grid domain (see Fig. 3 (left)), we\nempirically observe that the advantage of temporal abstraction is indeed preserved when removing\nthe knowledge of the parameters of the option. This shows that the bene\ufb01t of temporal abstraction is\nnot just a mere artifact of prior knowledge on the options. Although the theoretical bound in Thm. 1\nis always worse than its SMDP counterpart (14), we see that FSUCRL performs much better than\nSUCRL in our examples. This can be explained by the fact that the options we use greatly overlap.\nEven if our regret bound does not make explicit the fact that FSUCRL exploits the correlation between\noptions, this can actually signi\ufb01cantly impact the result in practice. The two versions of SUCRL\ndiffer in the amount of prior knowledge given to the algorithm to construct the parameters \u03c3+\nr and\n\u03c4 that are used in building the con\ufb01dence intervals.In v3 we provide a tight upper-bound rmax\n\u03c3+\non the rewards and distinct option-dependent parameters for the duration (\u03c4o and \u03c3\u03c4 (o)), in v2 we\nonly provide a global (option-independent) upper bound on \u03c4o and \u03c3o. Unlike FSUCRL which is\n\u201cparameter-free\u201d, SUCRL is highly sensitive to the prior knowledge about options and can perform\neven worse than UCRL. A similar behaviour is observed in Fig. 3 (right) where both the versions\nof SUCRL fail to beat UCRL but FSUCRLV2 has nearly half the regret of UCRL. On the contrary,\nFSUCRLV1 suffers a linear regret due to a loose dependency on the condition numbers (see App. F.2).\nThis shows that the condition numbers appearing in the bound of FSUCRLV2 are actually loose. In\nboth experiments, UCRL and FSUCRL had similar running times meaning that the improvement in\ncumulative regret is not at the expense of the computational complexity.\n\n6 Conclusions\n\nWe introduced FSUCRL, a parameter-free algorithm to learn in MDPs with options by combining\nthe SMDP view to estimate the transition probabilities at the level of options (p(s(cid:48)|s, o)) and the\nMDP structure of options to estimate the stationary distribution of an associated irreducible MC\nwhich allows to compute the optimistic policy at each episode. The resulting regret matches SUCRL\nbound up to an additive term. While in general, this additional regret may be large, we show both\ntheoretically and empirically that FSUCRL is actually competitive with SUCRL and it retains the\nadvantage of temporal abstraction w.r.t. learning without options. Since FSUCRL does not require\nstrong prior knowledge about options and its regret bound is partially computable, we believe the\nresults of this paper could be used as a basis to construct more principled option discovery algorithms\nthat explicitly optimize the exploration-exploitation performance of the learning algorithm.\n\n9\n\n246810120.60.70.80.911.1MaximaldurationofoptionsTmaxRatioofregretsRUCRLFSUCRLv1FSUCRLv2SUCRLv2SUCRLv30246810\u00b71080123\u00b7106DurationTnCumulativeRegret\u2206(Tn)UCRLFSUCRLv1FSUCRLv2SUCRLv2SUCRLv3\fAcknowledgments\n\nThis research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn\n(n.ANR-14-CE24-0010-01).\n\nReferences\n[1] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1):\n181 \u2013 211, 1999.\n\n[2] Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement\nlearning using diverse density. In Proceedings of the Eighteenth International Conference on\nMachine Learning, pages 361\u2013368, 2001.\n\n[3] Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut\u2014dynamic discovery of sub-goals in\nreinforcement learning. In Proceedings of the 13th European Conference on Machine Learning,\nHelsinki, Finland, August 19\u201323, 2002, pages 295\u2013306. Springer Berlin Heidelberg, 2002.\n\n[4] \u00d6zg\u00fcr \u00b8Sim\u00b8sek and Andrew G. Barto. Using relative novelty to identify useful temporal abstrac-\ntions in reinforcement learning. In Proceedings of the Twenty-\ufb01rst International Conference on\nMachine Learning, ICML \u201904, 2004.\n\n[5] Pablo Samuel Castro and Doina Precup. Automatic construction of temporally extended actions\nfor mdps using bisimulation metrics. In Proceedings of the 9th European Conference on Recent\nAdvances in Reinforcement Learning, EWRL\u201911, pages 140\u2013152, Berlin, Heidelberg, 2012.\nSpringer-Verlag.\n\n[6] K\ufb01r Y. Levy and Nahum Shimkin. Uni\ufb01ed inter and intra options learning using policy gradient\nIn EWRL, volume 7188 of Lecture Notes in Computer Science, pages 153\u2013164.\n\nmethods.\nSpringer, 2011.\n\n[7] Munu Sairamesh and Balaraman Ravindran. Options with exceptions. In Proceedings of the\n9th European Conference on Recent Advances in Reinforcement Learning, EWRL\u201911, pages\n165\u2013176, Berlin, Heidelberg, 2012. Springer-Verlag.\n\n[8] Timothy Arthur Mann, Daniel J. Mankowitz, and Shie Mannor. Time-regularized interrupting\noptions (TRIO). In Proceedings of the 31th International Conference on Machine Learning,\nICML 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1350\u20131358.\nJMLR.org, 2014.\n\n[9] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, and Shie Mannor. A deep\nhierarchical approach to lifelong learning in minecraft. In Proceedings of the Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, February 4-9, 2017, San Francisco, California, USA.,\npages 1553\u20131561. AAAI Press, 2017.\n\n[10] Martin Stolle and Doina Precup. Learning options in reinforcement learning. In SARA, volume\n\n2371 of Lecture Notes in Computer Science, pages 212\u2013223. Springer, 2002.\n\n[11] Timothy A. Mann and Shie Mannor. Scaling up approximate value iteration with options: Better\npolicies with fewer iterations. In Proceedings of the 31th International Conference on Machine\nLearning, ICML 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages\n127\u2013135. JMLR.org, 2014.\n\n[12] Nicholas K. Jong, Todd Hester, and Peter Stone. The utility of temporal abstraction in rein-\nforcement learning. In The Seventh International Joint Conference on Autonomous Agents and\nMultiagent Systems, May 2008.\n\n[13] Emma Brunskill and Lihong Li. PAC-inspired Option Discovery in Lifelong Reinforcement\nLearning. In Proceedings of the 31st International Conference on Machine Learning, ICML\n2014, volume 32 of JMLR Proceedings, pages 316\u2013324. JMLR.org, 2014.\n\n[14] Ronan Fruit and Alessandro Lazaric. Exploration\u2013exploitation in mdps with options.\n\nIn\nProceedings of Machine Learning Research, volume 54: Arti\ufb01cial Intelligence and Statistics,\n20-22 April 2017, Fort Lauderdale, FL, USA, pages 576\u2013584, 2017.\n\n[15] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n10\n\n\f[16] A. Federgruen, P.J. Schweitzer, and H.C. Tijms. Denumerable undiscounted semi-markov\ndecision processes with unbounded rewards. Mathematics of Operations Research, 8(2):298\u2013\n313, 1983.\n\n[17] Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation\nfor markov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331,\nDecember 2008.\n\n[18] Daniel J. Hsu, Aryeh Kontorovich, and Csaba Szepesv\u00e1ri. Mixing time estimation in reversible\nmarkov chains from a single sample path. In Proceedings of the 28th International Conference\non Neural Information Processing Systems, NIPS 15, pages 1459\u20131467. MIT Press, 2015.\n\n[19] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\nment learning. In Proceedings of the 28th International Conference on Neural Information\nProcessing Systems, NIPS 15, pages 2818\u20132826. MIT Press, 2015.\n\n[20] Grace E. Cho and Carl D. Meyer. Comparison of perturbation bounds for the stationary\ndistribution of a markov chain. Linear Algebra and its Applications, 335(1):137 \u2013 150, 2001.\n[21] Stephen J. Kirkland, Michael Neumann, and Nung-Sing Sze. On optimal condition numbers for\n\nmarkov chains. Numerische Mathematik, 110(4):521\u2013537, Oct 2008.\n\n[22] E. Seneta. Sensitivity of \ufb01nite markov chains under perturbation. Statistics & Probability\n\nLetters, 17(2):163\u2013168, May 1993.\n\n[23] Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function\n\ndecomposition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[24] Ronald Ortner. Optimism in the face of uncertainty should be refutable. Minds and Machines,\n\n18(4):521\u2013526, 2008.\n\n[25] Pierre Bremaud. Applied Probability Models with Optimization Applications, chapter 3: Recur-\n\nrence and Ergodicity. Springer-Verlag Inc, Berlin; New York, 1999.\n\n[26] Pierre Bremaud. Applied Probability Models with Optimization Applications, chapter 2:\n\nDiscrete-Time Markov Models. Springer-Verlag Inc, Berlin; New York, 1999.\n\n[27] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.\n\n[28] Peter L. Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on\nUncertainty in Arti\ufb01cial Intelligence, UAI \u201909, pages 35\u201342. AUAI Press, 2009.\n\n[29] Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral\n\nmethods. Electronic Journal of Probability, 20, 2015.\n\n[30] Martin Wainwright. Course on Mathematical Statistics, chapter 2: Basic tail and concentration\n\nbounds. University of California at Berkeley, Department of Statistics, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1796, "authors": [{"given_name": "Ronan", "family_name": "Fruit", "institution": "Inria Lille"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "CMU"}]}