{"title": "Selecting the State-Representation in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2627, "page_last": 2635, "abstract": "The problem of selecting the right state-representation in a reinforcement learning problem is considered.   Several models (functions mapping past observations to a finite set) of the observations are given, and it is known   that for at least one of these models the resulting state dynamics are indeed Markovian. Without  knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting   MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that,   with a regret of order T^{2/3} where T is the horizon time.", "full_text": "Selecting the State-Representation\n\nin Reinforcement Learning\n\nOdalric-Ambrym Maillard\nINRIA Lille - Nord Europe\n\nodalricambrym.maillard@gmail.com\n\nR\u00b4emi Munos\n\nINRIA Lille - Nord Europe\nremi.munos@inria.fr\n\nDaniil Ryabko\n\nINRIA Lille - Nord Europe\ndaniil@ryabko.net\n\nAbstract\n\nThe problem of selecting the right state-representation in a reinforcement learning\nproblem is considered. Several models (functions mapping past observations to\na \ufb01nite set) of the observations are given, and it is known that for at least one of\nthese models the resulting state dynamics are indeed Markovian. Without know-\ning neither which of the models is the correct one, nor what are the probabilistic\ncharacteristics of the resulting MDP, it is required to obtain as much reward as the\noptimal policy for the correct model (or for the best of the correct models, if there\nare several). We propose an algorithm that achieves that, with a regret of order\nT 2/3 where T is the horizon time.\n\n1\n\nIntroduction\n\nWe consider the problem of selecting the right state-representation in an average-reward reinforce-\nment learning problem. Each state-representation is de\ufb01ned by a model \u03c6j (to which corresponds a\nstate space S\u03c6j ) and we assume that the number J of available models is \ufb01nite and that (at least) one\nmodel is a weakly-communicating Markov decision process (MDP). We do not make any assump-\ntion at all about the other models. This problem is considered in the general reinforcement learning\nsetting, where an agent interacts with an unknown environment in a single stream of repeated ob-\nservations, actions and rewards. There are no \u201cresets,\u201d thus all the learning has to be done online.\nOur goal is to construct an algorithm that performs almost as well as the algorithm that knows both\nwhich model is a MDP (knows the \u201ctrue\u201d model) and the characteristics of this MDP (the transition\nprobabilities and rewards).\nConsider some examples that help motivate the problem. The \ufb01rst example is high-level feature\nselection. Suppose that the space of histories is huge, such as the space of video streams or that of\ngame plays. In addition to these data, we also have some high-level features extracted from it, such\nas \u201cthere is a person present in the video\u201d or \u201cthe adversary (in a game) is aggressive.\u201d We know that\nmost of the features are redundant, but we also know that some combination of some of the features\ndescribes the problem well and exhibits Markovian dynamics. Given a potentially large number\nof feature combinations of this kind, we want to \ufb01nd a policy whose average reward is as good as\nthat of the best policy for the right combination of features. Another example is bounding the order\nof an MDP. The process is known to be k-order Markov, where k is unknown but un upper bound\nK >> k is given. The goal is to perform as well as if we knew k. Yet another example is selecting\nthe right discretization. The environment is an MDP with a continuous state space. We have several\ncandidate quantizations of the state space, one of which gives an MDP. Again, we would like to \ufb01nd\na policy that is as good as the optimal policy for the right discretization. This example also opens\n\n1\n\n\fthe way for extensions of the proposed approach: we would like to be able to treat an in\ufb01nite set\nof possible discretization, none of which may be perfectly Markovian. The present work can be\nconsidered the \ufb01rst step in this direction.\nIt is important to note that we do not make any assumptions on the \u201cwrong\u201d models (those that do\nnot have Markovian dynamics). Therefore, we are not able to test which model is Markovian in the\nclassical statistical sense, since in order to do that we would need a viable alternative hypothesis\n(such as, the model is not Markov but is K-order Markov). In fact, the constructed algorithm never\n\u201cknows\u201d which model is the right one; it is \u201conly\u201d able to get the same average level of reward as if\nit knew.\nPrevious work. This work builds on previous work on learning average-reward MDPs. Namely,\nwe use in our algorithm as a subroutine the algorithm UCRL2 of [6] that is designed to provide\n\ufb01nite time bounds for undiscounted MDPs. Such a problem has been pioneered in the reinforcement\nlearning literature by [7] and then improved in various ways by [4, 11, 12, 6, 3]; UCRL2 achieves a\nregret of the order DT 1/2 in any weakly-communicating MDP with diameter D, with respect to the\nbest policy for this MDP. The diameter D of a MDP is de\ufb01ned in [6] as the expected minimum time\nrequired to reach any state starting from any other state. A related result is reported in [3], which\nimproves on constants related to the characteristics of the MDP.\nA similar approach has been considered in [10]; the difference is that in that work the probabilistic\ncharacteristics of each model are completely known, but the models are not assumed to be Marko-\nvian, and belong to a countably in\ufb01nite (rather than \ufb01nite) set.\nThe problem we address can be also viewed as a generalization of the bandit problem (see e.g. [9,\n8, 1]): there are \ufb01nitely many \u201carms\u201d, corresponding to the policies used in each model, and one\nof the arms is the best, in the sense that the corresponding model is the \u201ctrue\u201d one. In the usual\nbandit setting, the rewards are assumed to be i.i.d. thus one can estimate the mean value of the arms\nwhile switching arbitrarily from one arm to the next (the quality of the estimate only depends on the\nnumber of pulls of each arm). However, in our setting, estimating the average-reward of a policy\nrequires playing it many times consecutively. This can be seen as a bandit problem with dependent\narms, with complex costs of switching between arms.\nContribution. We show that despite the fact that the true Markov model of states is unknown\nand that nothing is assumed on the wrong representations, it is still possible to derive a \ufb01nite-time\nanalysis of the regret for this problem. This is stated in Theorem 1; the bound on the regret that we\nobtain is of order T 2/3.\nThe intuition is that if the \u201ctrue\u201d model \u03c6\u2217 is known, but its probabilistic properties are not, then we\nstill know that there exists an optimal control policy that depends on the observed state sj\u2217,t only.\nTherefore, the optimal rate of rewards can be obtained by a clever exploration/exploitation strategy,\nsuch as UCRL2 algorithm [6]. Since we do not know in advance which model is a MDP, we need\nto explore them all, for a suf\ufb01ciently long time in order to estimate the rate of rewards that one can\nget using a good policy in that model.\nOutline. In Section 2 we introduce the precise notion of model and set up the notations. Then we\npresent the proposed algorithm in Section 3; it uses UCRL2 of [6] as a subroutine and selects the\nmodels \u03c6 according to a penalized empirical criterion. In Section 4 we discuss some directions for\nfurther development. Finally, Section 5 is devoted to the proof of Theorem 1.\n\n2 Notation and de\ufb01nitions\nWe consider a space of observations O, a space of actions A, and a space of rewards R (all assumed\nto be Polish). Moreover, we assume that A is of \ufb01nite cardinality A def= |A| and that 0 \u2208 R \u2282 [0, 1].\nThe set of histories up to time t for all t \u2208 N\u222a{0} will be denoted by H<t\ndef= O\u00d7 (A\u00d7R\u00d7O)t\u22121,\nand we de\ufb01ne the set of all possible histories by H def=\nEnvironments. For a Polish X , we Denote by P(X ) the set of probability distributions over X .\nDe\ufb01ne an environment to be a mapping from the set of histories H to the set of functions that map\nany action a \u2208 A to a probability distribution \u03bda \u2208 P(R \u00d7 O) over the product space of rewards\nand observations.\n\n\u221e(cid:91)\n\nH<t.\n\nt=1\n\n2\n\n\fWe consider the problem of reinforcement learning when the learner interacts with some unknown\nenvironment e(cid:63). The interaction is sequential and goes as follows: \ufb01rst some h<1 = {o0} is gen-\nerated according to \u03b9, then at time step t > 0, the learner choses an action at \u2208 A according to the\ncurrent history h<t \u2208 H<t. Then a couple of reward and observations (rt, ot) is drawn according\nto the distribution (e(cid:63)(h<t))at \u2208 P(R \u00d7 O). Finally, h<t+1 is de\ufb01ned by the concatenation of h<t\nwith (at, rt, ot). With these notations, at each time step t > 0, ot\u22121 is the last observation given\nto the learner before choosing an action, at is the action output at this step, and rt is the immediate\nreward received after playing at.\nState representation functions (models). Let S \u2282 N be some \ufb01nite set; intuitively, this has to be\nconsidered as a set of states. A state representation function \u03c6 is a function from the set of histories\nH to S. For a state representation function \u03c6, we will use the notation S\u03c6 for its set of states, and\nst,\u03c6 := \u03c6(h<t).\nIn the sequel, when we talk about a Markov decision process, it will be assumed to be weakly\ncommunicating, which means that for each pair of states u1, u2 there exists k \u2208 N and a sequence\nof actions \u03b11, .., \u03b1k \u2208 A such that P (sk+1,\u03c6 = u2|s1,\u03c6 = u1, a1 = \u03b11...ak = \u03b1k) > 0. Having\nthat in mind, we introduce the following de\ufb01nition.\n\nDe\ufb01nition 1 We say that an environment e with a state representation function \u03c6 is Markov, or, for\nshort, that \u03c6 is a Markov model (of e), if the process (st,\u03c6, at, rt), t \u2208 N is a (weakly communicating)\nMarkov decision process.\n\nFor example, consider a state-representation function \u03c6 that depends only on the last observation,\nand that partitions the observation space into \ufb01nitely many cells. Then an environment is Markov\nwith this representation function if the probability distribution on the next cells only depends on the\nlast observed cell and action. Note that there may be many state-representation functions with which\nan environment e is Markov.\n3 Main results\nGiven a set \u03a6 = {\u03c6j; j (cid:54) J} of J state-representation functions (models), one of which being\na Markov model of the unknown environment e(cid:63), we want to construct a strategy that performs\nnearly as well as the best algorithm that knows which \u03c6j is Markov, and knows all the probabilistic\ncharacteristics (transition probabilities and rewards) of the MDP corresponding to this model. For\nthat purpose we de\ufb01ne the regret of any strategy at time T , like in [6, 3], as\n\n\u2206(T ) def= T \u03c1(cid:63) \u2212 T(cid:88)\nE((cid:80)T\n\nt=1\n\nrt ,\n\nwhere rt are the rewards received when following the proposed strategy and \u03c1(cid:63) is the average optimal\nvalue in the best Markov model, i.e., \u03c1(cid:63) = limT\nt=1 rt(\u03c0(cid:63))) where rt(\u03c0(cid:63)) are the rewards\nreceived when following the optimal policy for the best Markov model. Note that this de\ufb01nition\nmakes sense since when the MDP is weakly communicating, the average optimal value of reward\ndoes not depend on the initial state. Also, one could replace T \u03c1\u2217 with the expected sum of rewards\n\u221a\nobtained in T steps (following the optimal policy) at the price of an additional O(\n\nT ) term.\n\n1\nT\n\nIn the next subsection, we describe an algorithm that achieves a sub-linear regret of order T 2/3.\n\n3.1 Best Lower Bound (BLB) algorithm\nIn this section, we introduce the Best-Lower-Bound (BLB) algorithm, described in Figure 1.\nThe algorithm works in stages of doubling length. Each stage consists in 2 phases: an exploration\nand an exploitation phase.\nIn the exploration phase, BLB plays the UCRL2 algorithm on each\nmodel (\u03c6j)1(cid:54)j(cid:54)J successively, as if each model \u03c6j was a Markov model, for a \ufb01xed number \u03c4i,1,J\nof rounds. The exploitation part consists in selecting \ufb01rst the model with highest lower bound,\naccording to the empirical rewards obtained in the previous exploration phase. This model is initially\nselected for the same time as in the exploration phase, and then a test decides to either continue\nplaying this model (if its performance during exploitation is still above the corresponding lower\nbound, i.e. if the rewards obtained are still at least as good as if it was playing the best model). If it\ndoes not pass the test, then another model (with second best lower-bound) is select and played, and\nso on. Until the exploitation phase (of \ufb01xed length \u03c4i,2) \ufb01nishes and the next stage starts.\n\n3\n\n\fParameters: f, \u03b4\nFor each stage i (cid:62) 1 do\nSet the total length of stage i to be \u03c4i := 2i.\n\n1. Exploration. Set \u03c4i,1 = \u03c4 2/3\n\ni\n\n. For each j \u2208 {1, . . . , J} do\n\nsteps:\ninduced by \u03c6j.\n\nduring this exploration phase.\n\n2. Exploitation. Set \u03c4i,2 = \u03c4i \u2212 \u03c4i,1 and initialize J := {1, . . . , J} .\nWhile the current length of the exploitation part is less than \u03c4i,2 do\n\n\u2013 Run UCRL2 with parameter \u03b4i(\u03b4) de\ufb01ned in (1) using \u03c6j during \u03c4i,1,J\nthe state space is assumed to be S\u03c6j with transition structure\n\u2013 Compute the corresponding average empirical reward(cid:98)\u00b5i,1(\u03c6j) received\n\u2013 Select(cid:98)j = argmax\n\u2013 Run UCRL2 with parameter \u03b4i(\u03b4) using \u03c6(cid:98)j: update at each time step t\nthe current average empirical reward(cid:98)\u00b5i,2,t(\u03c6(cid:98)j) from the beginning of\n\u2013 If the test fails, then stop UCRL2 and set J := J \\ {(cid:98)j}. If J = \u2205\n\nthe run. Provided that the length of the current run is larger than \u03c4i,1,J,\ndo the test\n\n(cid:98)\u00b5i,2,t(\u03c6(cid:98)j) (cid:62)(cid:98)\u00b5i,1(\u03c6(cid:98)j) \u2212 2B(i, \u03c6(cid:98)j, \u03b4) .\n\n(cid:98)\u00b5i,1(\u03c6j) \u2212 2B(i, \u03c6j, \u03b4) (using (3)).\n\nj\u2208J\n\nthen set J := {1, . . . , J}.\n\nFigure 1: The Best-Lower-Bound selection strategy.\n\ndef= \u03c4i,1\n\ndef= 2i. Thus for a total time horizon T , the number\nThe length of stage i is \ufb01xed and de\ufb01ned to be \u03c4i\nof stages I(T ) before time T is I(T ) def= (cid:120)log2(T + 1)(cid:121). Each stage i (of length \u03c4i) is further\ndecomposed into an exploration (length \u03c4i,1) and an exploitation (length \u03c4i,2) phases.\nExploration phase. All the models {\u03c6j}j(cid:54)J are played one after another for the same amount of\nJ . Each episode 1 (cid:54) j (cid:54) J consists in running the UCRL2 algorithm using the\ntime \u03c4i,1,J\nmodel of states and transitions induced by the state-representation function \u03c6j. Note that UCRL2\ndoes not require the horizon T in advance, but requires a parameter p in order to ensure a near\noptimal regret bound with probability higher than 1 \u2212 p. We de\ufb01ne this parameter p to be \u03b4i(\u03b4) in\nstage i, where\n\n\u03b4i(\u03b4) def= (2i \u2212 (J\u22121 + 1)22i/3 + 4)\u221212\u2212i+1\u03b4 .\n\nThe average empirical reward received during each episode is written(cid:98)\u00b5i,1(\u03c6j).\nExploitation phase. We use the empirical rewards (cid:98)\u00b5i,1(\u03c6j) received in the previous exploration\n\npart of stage i together with a con\ufb01dence bound in order to select the model to play. Moreover, a\nmodel \u03c6 is no longer run for a \ufb01xed period of time (as in the exploration part of stage i), but for a\nperiod \u03c4i,2(\u03c6) that depends on some test; we \ufb01rst initialize J := {1, . . . , J} and then choose\n\n(1)\n\nwhere we de\ufb01ne\n\n(cid:98)j def= argmax\n\nj\u2208J\n\n(cid:98)\u00b5i,1(\u03c6j) \u2212 2B(i, \u03c6j, \u03b4) ,\n\n(cid:115)\n\nB(i, \u03c6, \u03b4) def= 34f (\u03c4i \u2212 1 + \u03c4i,1)|S\u03c6|\n\nA log( \u03c4i,1,J\n\u03b4i(\u03b4) )\n\n\u03c4i,1,J\n\n,\n\n(2)\n\n(3)\n\nwhere \u03b4 and the function f are parameters of the BLB algorithm. Then UCRL2 is played using the\n\nwe receive during this exploitation phase is high enough; at time t, if the length of the current episode\n\nselected model \u03c6(cid:98)j for the parameter \u03b4i(\u03b4). In parallel we test whether the average empirical reward\nis larger than \u03c41,i,J, we test if(cid:98)\u00b5i,2,t(\u03c6(cid:98)j) (cid:62)(cid:98)\u00b5i,1(\u03c6(cid:98)j) \u2212 2B(i, \u03c6(cid:98)j, \u03b4).\nmodel(cid:98)j is discarded (until the end of stage i) i.e. we update J := J \\ {(cid:98)j} and we select a new one\n\n(4)\nIf the test is positive, we keep playing UCRL2 using the same model. Now, if the test fails, then the\n\naccording to (2). We repeat those steps until the total time \u03c4i,2 of the exploitation phase of stage i is\nover.\n\n4\n\n\f(cid:16)\n\nT 2/3 + c\n\nDS\n\nA log(\u03b4\n\n(cid:48)\n\n(cid:17)1/2\n\n\u22121) log2(T )T\n\n(cid:16)\nAJ log(cid:0)(J\u03b4)\n\n(cid:17)1/2\n\u22121(cid:1) log2(T )\n\nRemark Note that the model selected for exploitation in (2) is the one that has the best lower bound.\nThis is a pessimistic (or robust) selection strategy. We know that if the right model is selected, then\nwith high probability, this model will be kept during the whole exploitation phase. If this is not the\nright model, then either the policy provides good rewards and we should keep playing it, or it does\nnot, in which case it will not pass the test (4) and will be removed from the set of models that will\nbe exploited in this phase.\n3.2 Regret analysis\nTheorem 1 (Main result) Assume that a \ufb01nite set of J state-representation functions \u03a6 is given,\nand there exists at least one function \u03c6(cid:63) \u2208 \u03a6 such that with \u03c6(cid:63) as a state-representation function the\nenvironment is a Markov decision process. If there are several such models, let \u03c6(cid:63) be the one with\nthe highest average reward of the optimal policy of the corresponding MDP. Then the regret (with\nrespect to the optimal policy corresponding to \u03c6\u2217) of the BLB algorithm run with parameter \u03b4, for\nany horizon T , with probability higher than 1 \u2212 \u03b4 is bounded as follows\n\u2206(T ) (cid:54) cf (T )S\n+ c(f, D), (5)\nfor some numerical constants c, c(cid:48) and c(f, D). The parameter f (t) can be chosen to be any\nincreasing function, for instance the choice f (t) := log2 t + 1, gives c(f, D) (cid:54) 2D.\nThe proof of this result is reported in Section 5.\nRemark. Importantly, the algorithm considered here does not know in advance the diameter D of\nthe true model, nor the time horizon T . Due to this lack of knowledge, it uses a guess f (t) (e.g.\nlog(t)) on this diameter, which result in the additional regret term c(f, D) and the additional factor\nf (T ); knowing D would enable to remove both of them, but this is a strong assumption. Choosing\nf (t) := log2 t + 1 gives a bound which is of order T 2/3 in T but is exponential in D; taking\nf (t) := t\u03b5 we get a bound of order T 2/3+\u03b5 in T but of polynomial order 1/\u03b5 in D.\n4 Discussion and outlook\nIntuition. The main idea why this algorithm works is as follows. The \u201cwrong\u201d models are used\nduring exploitation stages only as long as they are giving rewards that are higher than the rewards\nthat could be obtained in the \u201ctrue\u201d model. All the models are explored suf\ufb01ciently long so as\nto be able to estimate the optimal reward level in the true model, and to learn its policy. Thus,\nnothing has to be known about the \u201cwrong\u201d models. This is in stark contrast to the usual situation\nin mathematical statistics, where to be able to test a hypothesis about a model (e.g., that the data is\ngenerated by a certain model versus some alternative models), one has to make assumptions about\nalternative models. This has to be done in order to make sure that the Type II error is small (the\npower of the test is large): that this error is small has to be proven under the alternative. Here,\nalthough we are solving seemingly the same problem, the role of the Type II error is played by the\nrewards. As long as the rewards are high we do not care where the model we are using is correct or\nnot. We only have to ensure that the true model passes the test.\nAssumptions. A crucial assumption made in this work is that the \u201ctrue\u201d model \u03c6\u2217 belongs to a\nknown \ufb01nite set. While passing from a \ufb01nite to a countably in\ufb01nite set appears rather straightfor-\nward, getting rid of the assumption that this set contains the true model seems more dif\ufb01cult. What\none would want to obtain in this setting is sub-linear regret with respect to the performance of the\noptimal policy in the best model; this, however, seems dif\ufb01cult without additional assumptions on\nthe probabilistic characteristics of the models. Another approach not discussed here would be to try\nto build a good state representation function, as what is suggested for instance in [5]. Yet another\ninteresting generalization in this direction would be to consider uncountable (possibly parametric\nbut general) sets of models. This, however, would necessarily require some heavy assumptions on\nthe set of models.\nRegret. The reader familiar with adversarial bandit literature will notice that our bound of order\nT 2/3 is worse than T 1/2 that usually appears in this context (see, for example [2]). The reason is\nthat our notion of regret is different: in adversarial bandit literature, the regret is measured with\nrespect to the best choice of the arm for the given \ufb01xed history. In contrast, we measure the regret\nwith respect to the best policy (for knows the correct model and its parameters) that, in general,\nwould obtain completely different (from what our algorithm would get) rewards and observations\nright from the beginning.\n\n5\n\n\fEstimating the diameter? As previously mentioned, a possibly large additive constant c(f, D)\nappears in the regret since we do not known a bound on the diameter of the MDP in the \u201ctrue\u201d\nmodel, and use log T instead. Finding a way to properly address this problem by estimating online\nthe diameter of the MDP is an interesting open question. Let us provide some intuition concerning\nthis problem. First, we notice that, as reported in [6], when we compute an optimistic model based\non the empirical rewards and transitions of the true model, the span of the corresponding optimistic\n\nvalue function sp((cid:98)V +) is always smaller than the diameter D. This span increases as we get more\nseems quite dif\ufb01cult to compute a tight empirical upper bound on D (or sp((cid:98)V +)). In [3], the authors\nthat sp((cid:98)V +) (cid:54) sp(V (cid:63)), we need to introduce an explicit penalization in order to control the span of\n\nderive a regret bound that scales with the span of the true value function sp(V (cid:63)), which is also less\nthan D, and can be signi\ufb01cantly smaller in some cases. However, since we do not have the property\n\nrewards and transitions samples, which gives a natural empirical lower bound on D. However, it\n\nthe computed optimistic models, and this requires assuming we know an upper bound B on sp(V (cid:63))\nin order to guarantee a \ufb01nal regret bound scaling with B. Unfortunately this does not solve the\nestimation problem of D, which remains an open question.\n5 Proof of Theorem 1\nIn this section, we now detail the proof of Theorem 1. The proof is stated in several parts. First we\nremind a general con\ufb01dence bound for the UCRL2 algorithm in the true model. Then we decompose\nthe regret into the sum of the regret in each stage i. After analyzing the contribution to the regret in\nstage i, we then gather all stages and tune the length of each stage and episode in order to get the\n\ufb01nal regret bound.\n5.1 Upper and Lower con\ufb01dence bounds\nFrom the analysis of UCRL2 in [6], we have the property that with probability higher than 1 \u2212 \u03b4(cid:48),\nthe regret of UCRL2 when run for \u03c4 consecutive many steps from time t1 in the true model \u03c6(cid:63) is\nupper bounded by\n\n(cid:114)\n\n\u03c1(cid:63) \u2212 1\n\u03c4\n\nrt (cid:54) 34D|S\u03c6(cid:63)|\n\nA log( \u03c4\n\n\u03b4(cid:48) )\n\n\u03c4\n\n,\n\n(6)\n\nt1+\u03c4\u22121(cid:88)\n\nt=t1\n\nwhere D is the diameter of the MDP. What is interesting is that this diameter does not need to be\nknown by the algorithm. Also by carefully looking at the proof of UCRL, it can be shown that the\nfollowing bound is also valid with probability higher than 1 \u2212 \u03b4(cid:48):\n\nWe now de\ufb01ne the following quantity, for every model \u03c6, episode length \u03c4 and \u03b4(cid:48) \u2208 (0, 1)\n\nBD(\u03c4, \u03c6, \u03b4(cid:48)) def= 34D|S\u03c6|\n\nA log( \u03c4\n\n\u03b4(cid:48) )\n\n\u03c4\n\n.\n\n(7)\n\n5.2 Regret of stage i\nIn this section we analyze the regret of the stage i, which we denote \u2206i. Note that since each stage\ni (cid:54) I is of length \u03c4i = 2i except the last one I that may stop before, we have\n\n\u2206i ,\n\n\u2206(T ) =\n\n(8)\nwhere I(T ) = (cid:120)log2(T +1)(cid:121). We further decompose \u2206i = \u22061,i+\u2206i,2 into the regret corresponding\nto the exploration stage \u22061,i and the regret corresponding to the exploitation stage \u2206i,2.\n\u03c4i,1 is the total length of the exploration stage i and \u03c4i,2 is the total length of the exploitation stage i.\nJ the number of consecutive steps during which the UCRL2\nFor each model \u03c6, we write \u03c4i,1,J\nalgorithm is run with model \u03c6 in the exploration stage i, and \u03c4i,2(\u03c6) the number of consecutive steps\nduring which the UCRL2 algorithm is run with model \u03c6 in the exploitation stage i.\nGood and Bad models. Let us now introduce the two following sets of models, de\ufb01ned after the\nend of the exploration stage, i.e. at time ti.\n\ndef= {\u03c6 \u2208 \u03a6 ; (cid:98)\u00b5i,1(\u03c6) \u2212 2B(i, \u03c6, \u03b4) \u2265(cid:98)\u00b5i,1(\u03c6(cid:63)) \u2212 2B(i, \u03c6(cid:63), \u03b4)}\\{\u03c6\u2217} ,\ndef= {\u03c6 \u2208 \u03a6 ; (cid:98)\u00b5i,1(\u03c6) \u2212 2B(i, \u03c6, \u03b4) <(cid:98)\u00b5i,1(\u03c6(cid:63)) \u2212 2B(i, \u03c6(cid:63), \u03b4)} .\n\nWith this de\ufb01nition, we have the decomposition \u03a6 = Gi \u222a {\u03c6(cid:63)} \u222a Bi.\n\ndef= \u03c4i,1\n\nGi\nBi\n\nI(T )(cid:88)\n\ni=1\n\n6\n\nt1+\u03c4\u22121(cid:88)\n\nt=t1\n\n1\n\u03c4\n\nrt \u2212 \u03c1(cid:63) (cid:54) 34D|S\u03c6(cid:63)|\n\nA log( \u03c4\n\n\u03b4(cid:48) )\n\n.\n\n\u03c4\n\n(cid:114)\n\n(cid:114)\n\n\f5.2.1 Regret in the exploration phase\nSince in the exploration stage i each model \u03c6 is run for \u03c4i,1,J many steps, the regret for each model\n\n\u03c6 (cid:54)= \u03c6(cid:63) is bounded by \u03c4i,1,J \u03c1(cid:63). Now the regret for the true model is \u03c4i,1,J (\u03c1(cid:63) \u2212(cid:98)\u00b51(\u03c6(cid:63))), thus the\n\ntotal contribution to the regret in the exploration stage i is upper-bounded by\n\n\u2206i,1 (cid:54) \u03c4i,1,J (\u03c1(cid:63) \u2212(cid:98)\u00b51(\u03c6(cid:63))) + (J \u2212 1)\u03c4i,1,J \u03c1(cid:63) .\n\n(9)\n\n5.2.2 Regret in the exploitation phase\nBy de\ufb01nition, all models in Gi \u222a {\u03c6(cid:63)} are selected before any model in Bi is selected.\nThe good models. Let us consider some \u03c6 \u2208 Gi and an event \u2126i under which the exploitation\nphase does not reset. The test (equation (4)) starts after \u03c4i,1,J, thus, since there is not reset, either\n\u03c4i,2(\u03c6) = \u03c4i,1,J in which case the contribution to the regret is bounded by \u03c4i,1,J \u03c1(cid:63) , or \u03c4i,2(\u03c6) >\n\u03c4i,1,J, in which case the regret during the (\u03c4i,2(\u03c6) \u2212 1) steps (where the test was successful) is\nbounded by\n\n(\u03c4i,2(\u03c6) \u2212 1)(\u03c1(cid:63) \u2212(cid:98)\u00b5i,2,\u03c4i,2(\u03c6)\u22121(\u03c6)) (cid:54) (\u03c4i,2(\u03c6) \u2212 1)(\u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6) + 2B(i, \u03c6, \u03b4))\n(cid:54) (\u03c4i,2(\u03c6) \u2212 1)(\u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63)) + 2B(i, \u03c6(cid:63), \u03b4)) ,\n\nand now since in the last step \u03c6 fails to pass the test, this adds a contribution to the regret at most \u03c1(cid:63).\nWe deduce that the total contribution to the regret of all the models \u03c6 \u2208 Gi in the exploitation stages\non the event \u2126i is bounded by\n\nmax{\u03c4i,1,J \u03c1(cid:63), (\u03c4i,2(\u03c6) \u2212 1)(\u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63)) + 2B(i, \u03c6(cid:63), \u03b4)) + \u03c1(cid:63)} .\n\n\u2206i,2(Gi) (cid:54)(cid:88)\n\nstages in episode i on \u2126i is bounded by\n\nThe true model. First, let us note that since the total regret of the true model during the exploitation\n\nstep i is given by \u03c4i,2(\u03c6(cid:63))(\u03c1(cid:63) \u2212(cid:98)\u00b5i,2,t(\u03c6(cid:63))) , then the total regret of the exploration and exploitation\n\u2206i (cid:54) \u03c4i,1,J (\u03c1(cid:63) \u2212(cid:98)\u00b51(\u03c6(cid:63))) + \u03c4i,1,J (J \u2212 1)\u03c1(cid:63) + \u03c4i,2(\u03c6(cid:63))(\u03c1(cid:63) \u2212(cid:98)\u00b5i,2,ti+\u03c4i,2 (\u03c6(cid:63))) +\n(cid:88)\nmax{\u03c4i,1,J \u03c1(cid:63), (\u03c4i,2(\u03c6) \u2212 1)(\u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63)) + 2B(i, \u03c6(cid:63), \u03b4)) + \u03c1(cid:63)} +\n\n(cid:88)\n\n(10)\n\n\u03c6\u2208G\n\n\u03c4i,2(\u03c6)\u03c1(cid:63) .\n\n\u03c6\u2208Bi\n\n\u03c6\u2208Gi\n\nNow from the analysis provided in [6] we know that when we run the UCRL2 with the true model\n\u03c6(cid:63) with parameter \u03b4i(\u03b4), then there exists an event \u21261,i of probability at least 1 \u2212 \u03b4i(\u03b4) such that on\nthis event\nand similarly there exists an event \u21262,i of probability at least 1 \u2212 \u03b4i(\u03b4), such that on this event\n\n\u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63)) (cid:54) BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4)) ,\n\u03c1(cid:63) \u2212(cid:98)\u00b5i,2,t(\u03c6(cid:63)) (cid:54) BD(\u03c4i,2(\u03c6(cid:63)), \u03c6(cid:63), \u03b41(\u03b4)) .\n\nNow we show that, with high probability, the true model \u03c6(cid:63) passes all the tests (equation (4)) until\nthe end of the episode i, and thus equivalently, with high probability no model \u03c6 \u2208 Bi is selected,\n\nso that (cid:88)\ncorresponding to (cid:98)\u00b5i,1(\u03c6(cid:63)) is shared by all the tests. Thus we deduce that with probability higher\n\nFor the true model, after \u03c4 (\u03c6(cid:63), t) (cid:62) \u03c4i,1,J, there remains at most (\u03c4i,2\u2212\u03c4i,1,J +1) possible timesteps\nwhere we do the test for the true model \u03c6(cid:63). For each test we need to control \u00b5i,2,t(\u03c6(cid:63)), and the event\nthan 1\u2212 (\u03c4i,2 \u2212 \u03c4i,1,J + 2)\u03b4i(\u03b4) we have simultaneously on all time step until the end of exploitation\nphase of stage i,\n\n(cid:98)\u00b5i,2,t(\u03c6(cid:63)) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63)) = (cid:98)\u00b5i,2,t(\u03c6(cid:63)) \u2212 \u03c1(cid:63) + \u03c1(cid:63) \u2212(cid:98)\u00b5i,1(\u03c6(cid:63))\n\n\u03c4i,2(\u03c6) = 0.\n\n\u03c6\u2208Bi\n\n(cid:62) \u2212BD(\u03c4 (\u03c6(cid:63), t), \u03c6(cid:63), \u03b4i(\u03b4)) \u2212 BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4))\n(cid:62) \u22122BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4)) .\n\nNow provided that f (ti) (cid:62) D, then BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4)) (cid:54) B(i, \u03c6(cid:63), \u03b4) , thus the true model passes\nall tests until the end of the exploitation part of stage i on an event \u21263,i of probability higher than\n1 \u2212 (\u03c4i,2 \u2212 \u03c4i,1,J + 2)\u03b4i(\u03b4). Since there is no reset, we can choose \u2126i\ndef= \u21263,i. Note that on this\n\nevent, we thus have (cid:88)\n\n\u03c6\u2208Bi\n\n\u03c4i,2(\u03c6) = 0.\n\n7\n\n\fBy using a union bound over the events \u21261,i, \u21262,i and \u21263,i, then we deduce that with probability\nhigher than 1 \u2212 (\u03c4i,2 \u2212 \u03c4i,1,J + 4)\u03b4i(\u03b4),\n\u2206i (cid:54) \u03c4i,1,J BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4))) + [\u03c4i,1,J (J \u2212 1) + |Gi|]\u03c1(cid:63) + \u03c4i,2(\u03c6(cid:63))BD(\u03c4i,2(\u03c6(cid:63)), \u03c6(cid:63), \u03b4i(\u03b4))\n\nmax{(\u03c4i,1,J \u2212 1)\u03c1(cid:63), (\u03c4i,2(\u03c6) \u2212 1)(BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4)) + 2B(i, \u03c6(cid:63), \u03b4)} .\n\n(cid:88)\n\n\u03c6\u2208Gi\n\n+\n\nNow using again the fact that f (ti) (cid:62) D, and after some simpli\ufb01cations, we deduce that\n\n\u2206i (cid:54) \u03c4i,1,J BD(\u03c4i,1,J , \u03c6(cid:63), \u03b4i(\u03b4)) + \u03c4i,2(\u03c6(cid:63))BD(\u03c4i,2(\u03c6(cid:63)), \u03c6(cid:63), \u03b4i(\u03b4))\n\n+\n\n(\u03c4i,2(\u03c6) \u2212 1)3B(i, \u03c6(cid:63), \u03b4) + \u03c4i,1,J (J + |Gi| \u2212 1)\u03c1(cid:63) .\n\n(cid:88)\n\n\u03c6\u2208Gi\n\nFinally, we use the fact that \u03c4 BD(\u03c4, \u03c6(cid:63), \u03b4i(\u03b4)) is increasing with \u03c4 to deduce the following rough\nbound that holds with probability higher than 1 \u2212 (\u03c4i,2 \u2212 \u03c4i,1,J + 4)\u03b4i(\u03b4)\n\n\u2206i (cid:54) \u03c4i,2B(i, \u03c6(cid:63), \u03b4) + \u03c4i,2BD(\u03c4i,2, \u03c6(cid:63), \u03b4i(\u03b4)) + 2J\u03c4i,1,J \u03c1(cid:63) ,\n\nwhere we used the fact that \u03c4i,2 = \u03c4i,2(\u03c6(cid:63)) +\n\n\u03c4i,2(\u03c6) .\n\n(cid:88)\n\n\u03c6\u2208G\n\n5.3 Tuning the parameters of each stage.\nWe now conclude by tuning the parameters of each stage, i.e. the probabilities \u03b4i(\u03b4) and the length\n\u03c4i, \u03c4i,1 and \u03c4i,2. The total length of stage i is by de\ufb01nition\n\n\u03c4i = \u03c4i,1 + \u03c4i,2 = \u03c4i,1,J J + \u03c4i,2 ,\n\ndef= \u03c4i \u2212 \u03c4 2/3\n\ni\n\nand \u03c4i,1,J = \u03c4 2/3\n\ni\n\ndef= \u03c4 2/3\n\ni\n\n(cid:17)\n\nAJ log\n\n(cid:115)\n\n(cid:16) \u03c4 2/3\n\nand then we have \u03c4i,2\n\n\u2206i (cid:54) 34f (ti)S\n\nwhere \u03c4i = 2i . So we set \u03c4i,1\nJ . Now\nusing these values and the de\ufb01nition of the bound B(i, \u03c6(cid:63), \u03b4), and BD(\u03c4i,2, \u03c6(cid:63), \u03b4i(\u03b4)), we deduce\nwith probability higher than 1 \u2212 (\u03c4i,2 \u2212 \u03c4i,1,J + 4)\u03b4i(\u03b4) the following upper bound\n\n(cid:17)\n(cid:16) \u03c4i\n\u03c4i,2 (cid:54) \u221a\nwith ti = 2i \u2212 1 + 22i/3 and where we used the fact that\nWe now de\ufb01ne \u03b4i(\u03b4) such that \u03b4i(\u03b4) def= (2i \u2212 (J\u22121 + 1)22i/3 + 4)\u221212\u2212i+1\u03b4 .\nSince for the stages i \u2208 I0\ndef= {i (cid:62) 1; f (ti) < D}, the regret is bounded by \u2206i (cid:54) \u03c4i\u03c1(cid:63), then the\ntotal cumulative regret of the algorithm is bounded with probability higher than 1 \u2212 \u03b4 (using the\nde\ufb01tion of the \u03b4i(\u03b4)) by\n\n(cid:114)\n(cid:16) J\n\n\u03c4 2/3\ni + 34DS\n\n(cid:17)1/2\n\n\u03c4i + 2\u03c4 2/3\n\nJ\u03c4 2/3\n\nJ\u03b4i(\u03b4)\n\n\u03b4i(\u03b4)\n\nA log\n\n\u03c1(cid:63) ,\n\n\u03c4 2/3\ni\n\ni\n\n.\n\ni\n\ni\n\n\u2206(T ) (cid:54) (cid:88)\n\ni /\u2208I0\n\n(cid:114)\n\n(cid:16) 28i/3\n\n(cid:17)\n\nJ\u03b4\n\n(cid:114)\n\n(cid:16) 23i\n\n(cid:17)\n\n\u03b4\n\n(cid:88)\n\ni\u2208I0\n\n[34f (ti)S\n\nJA log\n\n+ 2]22i/3 + 34DS\n\nA log\n\n2i +\n\n2i\u03c1(cid:63) .\n\n(cid:16)\n\n(cid:16)\n\nT 2/3 + c(cid:48)DS\n\nwhere ti = 2i \u2212 1 + 22i/3 (cid:54) T .\nWe conclude by using the fact that since I(T ) (cid:54) log2(T + 1), then with probability higher than\n1 \u2212 \u03b4, the following bound on the regret holds\n\u2206(T ) (cid:54) cf (T )S\n\n(cid:17)1/2\nfor some constant c, c(cid:48), and where c(f, D) =(cid:80)\nAJ log(J\u03b4)\u22121 log2(T )\n\n2i\u03c1(cid:63). Now for the special choice when f (T ) def=\nlog2(T +1), then i \u2208 I0 means 2i+22i/3 < 2D+2, thus we must have i < D, and thus c(f, d) (cid:54) 2D.\nAcknowledgements\nThis research was partially supported by the French Ministry of Higher Education and Research,\nNord- Pas-de-Calais Regional Council and FEDER through CPER 2007-2013, ANR projects\nEXPLO-RA (ANR-08-COSI-004) and Lampada (ANR-09-EMER-007), by the European Com-\nmunitys Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (project\nCompLACS), and by Pascal-2.\n\nA log(\u03b4\u22121) log2(T )T\n\n(cid:17)1/2\n\n+ c(f, D) .\n\ni\u2208I0\n\n8\n\n\fReferences\n[1] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[2] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged\nIn Foundations of Computer Science,\n\ncasino: The adversarial multi-armed bandit problem.\n1995. Proceedings., 36th Annual Symposium on, pages 322 \u2013331, oct 1995.\n\n[3] Peter L. Bartlett and Ambuj Tewari. REGAl: a regularization based algorithm for reinforce-\nment learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference\non Uncertainty in Arti\ufb01cial Intelligence, UAI, pages 35\u201342, Arlington, Virginia, United States,\n2009. AUAI Press.\n\n[4] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm\nfor near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231,\nMarch 2003.\n\n[5] Marcus Hutter. Feature reinforcement learning: Part I: Unstructured MDPs. Journal of Arti\ufb01-\n\ncial General Intelligence, 1:3\u201324, 2009.\n\n[6] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 99:1563\u20131600, August 2010.\n\n[7] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49:209\u2013232, November 2002.\n\n[8] Tze L. Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances\n\nin Applied Mathematics, 6:4\u201322, 1985.\n\n[9] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the Amer-\n\nican Mathematics Society, 58:527\u2013535, 1952.\n\n[10] Daniil Ryabko and Marcus Hutter. On the possibility of learning in reactive environments with\n\narbitrary dependence. Theoretical Compututer Science, 405:274\u2013284, October 2008.\n\n[11] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC\nIn Proceedings of the 23rd international conference on\n\nmodel-free reinforcement learning.\nMachine learning, ICML, pages 881\u2013888, New York, NY, USA, 2006. ACM.\n\n[12] Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret\nfor irreducible mdps. In Proceedings of Neural Information Processing Systems Conference\n(NIPS), 2007.\n\n9\n\n\f", "award": [], "sourceid": 1427, "authors": [{"given_name": "Odalric-ambrym", "family_name": "Maillard", "institution": null}, {"given_name": "Daniil", "family_name": "Ryabko", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}