{"title": "Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 726, "page_last": 736, "abstract": "We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously  optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving  near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the  Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.", "full_text": "Regret Minimization for Reinforcement Learning\nwith Vectorial Feedback and Complex Objectives\n\nDepartment of Industrial Systems Engineering and Management\n\nWang Chi Cheung\n\nNational University of Singapore\n\nisecwc@nus.edu.sg\n\nAbstract\n\nWe consider an agent who is involved in an online Markov Decision Process, and\nreceives a vector of outcomes every round. The agent optimizes an aggregate\nreward function on the multi-dimensional outcomes. Due to state transitions, it\nis challenging to balance the contribution from each dimension for achieving\nnear-optimality. Contrary to the single objective case, stationary policies are\ngenerally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe\nalgorithm (Frank and Wolfe 1956, Agrawal and Devanur 2014) , UCRL2 (Jaksch\net al. 2010), as well as a crucial and novel Gradient Threshold Procedure (GTP).\nGTP involves carefully delaying gradient updates, and returns a non-stationary\npolicy that diversi\ufb01es the outcomes for optimizing the aggregate reward.\n\n1\n\nIntroduction\n\nMarkov Decision Processes (MDPs) model sequential optimization problems with changes in the\nstate of the underlying environment. At each time, an agent performs an action, contingent upon the\ncurrent state. In\ufb02uenced by the present state and action, the agent transits to another state and receives\nsome form of feedback. Typically, the feedback is a scalar reward, and the agent aims to maximize\nthe total reward. Nevertheless, in many settings, the feedback is a vector of multiple outcomes, and\nthe agent\u2019s goal depend on each of these outcomes. Moreover, the underlying MDP model is usually\nnot known to the agent, and is to be learned on-the-\ufb02y. Motivated by these situations, we consider the\nComplex-Objective Online MDP (CO-OMDP) problem, which maximizes an aggregate function on\nthe average vectorial outcome.\nSolving the CO-OMDP problem requires overcoming the following subtle challenges. To maximize\nthe aggregate function, an agent has to balance the contributions from the outcomes\u2019 different\ncomponents by alternating among different actions, which are generally associated with different\nstates. Consequently, the agent has to traverse the state space, which could require visiting sub-\noptimal states that do not contribute to the maximization of the aggregate function. Altogether, the\nmaximization can be hindered by undesirable state transitions, which is worsened by the agent\u2019s\nmodel uncertainty.\nWe overcome the mentioned challenges by proposing TFW-UCRL2, a near-optimal online algorithm\nfor the CO-OMDP problem. The algorithm is built upon the Frank-Wolfe algorithm (FW) [21, 2],\nUCRL2 [28], as well as our novel Gradient Threshold Procedure (GTP). FW balances the objectives\nby scalarizing the outcomes, and UCRL2 solves scalarized online MDP problems under model\nuncertainty. However, FW and UCRL2 are not enough for overcoming the challenges in balancing the\noutcomes while avoiding sub-optimal states. GTP overcomes the challenges by judiciously delaying\nthe gradient updates in FW. The procedure approximately maintains the balancing effect by FW, while\nlimits the visits to sub-optimal states by switching among different stationary policies adaptively and\ninfrequently, despite the model uncertainty.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRelated Literature. The CO-OMDP problem is closely related to the Bandits with global concave\nRewards problem (BwR) [2] and the Scalar-Obejctive Online MDPs problem (SO-OMDP) [28].\nBwR concerns maximizing aggregate functions on vectorial feedback in stochastic bandit settings.\nBwR is \ufb01rst studied by [2], who solve BwR by a synergy of online convex optimization and upper\ncon\ufb01dence bound (UCB) algorithms. Subsequently, BwR is studied under various stochastic bandit\nsettings and reward functions [4, 17, 14]. BwR is closely related to Bandit with Knapsacks problem\n(BwK), which models stochastic bandit problems with resource constraints. BwK is \ufb01rst studied\nby [10], and is subsequently studied in various settings [2, 11, 3, 20]. The models for BwR and\nBwK assume i.i.d. outcomes across time, while in an MDP setting the outcome distribution changes\nendogenously according to the state transition.\nThe adversarial BwK problem is recently studied in [27]. Among other results, they show that no\nonline algorithm can achieve an expected value of \u2126(1/ log T ) times the of\ufb02ine optimum, where T is\nthe number of time steps. Our positive results on CO-OMDPs walk a \ufb01ne line between the negative\nresults for adversarial BwK and the positive results for stochastic BwR and BwK. Online optimization\nproblems with global reward functions are studied in adversarial settings with full feedback [19, 9].\nThe SO-OMDP problem is \ufb01rst studied by [8, 28] for communicating MDPs. Subsequently, the\nproblem is studied under the more general cases of weakly communicating MDPs [13, 23] and\nnon-communicating MDPs [22]. Posterior sampling algorithms for the SO-OMDP problem are\nproposed and analyzed [5, 37]. The SO-OMDP problem is also studied under certain mixing time\nassumptions on all stationary policies [36]. The SO-OMDP problem assumes scalar rewards, but do\nnot incorporate multi-objective optimization. For a review on MDPs, please consult [38, 15].\nReinforcement Learning (RL) with vectorial feedback and aggregate functions are studied in the\ndiscounted-reward setting [24, 6, 12, 43, 1, 25, 42, 29, 30, 33] and the average-reward [6, 32, 41, 40]\nsetting. We study the latter with an online model under model uncertainty. Our work shows non-\nasymptotic convergence to the optimum, which differs from [6, 32] who show asymptotic convergence.\nTarbouriech and Lazaric [41] study an online model for state space exploration, and achieve a non-\nasymptotic convergence to the optimum. In addition to the choice of the aggregate functions, our work\ndiffers from [41] in two aspects. First, the transition kernel is assumed to be known in [41], whereas\nthe kernel is not known in our model. Second, the reachability assumption of unichain MDPs is\nmade in [41], while we make the much weaker assumption of communicating MDPs. More recently,\nonline MDPs with adversarially chosen aggregate functions are studied by [40]. The model in [40] is\nepisodic, where the state is reset to a \ufb01xed state at the end of an episode (involving a \ufb01xed number\nof steps). In contrast, our setting does not involve any state reset. In [40] the aggregate function is\napplied only on the trajectory in each episode, whereas in our setting the aggregate function is applied\non the trajectory across the whole horizon. Finally, we point out that a substantial generalization of\nthe current paper has been put forth in [18].\nIn the discounted reward setting, multi-objective optimization are studied in [24, 6, 12, 43]. Many\nrecent works study the discounted-reward setting with resource constraints [1, 42, 29, 33]. Numerous\nrecent research works focus on state space exploration problems [25, 30] in the discounted-reward\nsetting. Constrained MDPs are reviewed in [6], and multi-objective RL is surveyed in [39, 31].\n\n2 Problem De\ufb01nition of CO-OMDP\nA CO-OMDP instance is speci\ufb01ed as (S, s1,A, p,V, g). The set S is a \ufb01nite state space, and s1 \u2208 S\nis the starting state. The collection A = {As}s\u2208S contains a \ufb01nite set of actions As for each state s.\nWe say (s, a) is a state-action pair iff s \u2208 S, a \u2208 As. The collections p = {p(\u00b7|s, a)}s\u2208S,a\u2208As is the\ntransition kernel, and the collection V = {V(s, a)}s\u2208S,a\u2208As governs the vectorial outcomes. When\nthe agent chooses action a \u2208 As at state s, her subsequent state s(cid:48) is distributed as p(\u00b7|s, a) \u2208 \u2206S.\nShe receives a stochastic vectorial outcome V (s, a) \u2208 [0, 1]K, distributed as V(s, a), and has\nk=1. We emphasize that s(cid:48), V1(s, a), . . . , VK(s, a) can be\nmean E[V (s, a)] = v(s, a) = (vk(s, a))K\narbitrarily correlated. We focus on the following reward function g : [0, 1]K \u2192 R\u22650, which is\nparameterized by L0 \u2208 R\u22650, L1, . . . , LK \u2208 R, and a convex compact set U \u2286 [0, 1]K:\n\ng(w) :=\n\n\u00b7\n\n1\nK\n\nLkwk \u2212 L0\n2\n\nmin\nu\u2208U\n\n(wk \u2212 uk)2\n\n.\n\n(1)\n\n(cid:34) K(cid:88)\n\nk=1\n\n(cid:41)(cid:35)\n\n(cid:40) K(cid:88)\n\nk=1\n\nThe function g is concave (see Appendix B.1), and is to be maximized.\n\n2\n\n\fDynamics. An agent, who faces an CO-OMDP instance M = (S, s1,A, p,V, g), starts at state\ns1 \u2208 S. At time t, three events happen. First, the agent observes his current state st. Second,\nshe takes an action at \u2208 Ast. Third, she transits to another state st+1 \u223c p(\u00b7|st, at), and receives\nthe vectorial outocme Vt(st, at) \u223c V(st, at). Both st+1 and Vt(st, at) are observed by the agent.\nThe whole dynamics result in a controlled Markov process {st, at, Vt(st, at)}\u221e\nt=1. Conditioned on\n(st, at), the random variable pair (st+1, Vt(st, at)) is independent of Ht\u22121.\nIn the second event, the choice of at is based on a non-anticipatory policy. The choice only depends\non the current state st and the previous observations Ht\u22121 := {sq, aq, Vq(sq, aq)}t\u22121\nq=1. When at only\ndepends on st, but not on Ht\u22121, the corresponding non-anticipatory policy is said to be stationary.\nObjective. The CO-OMDP instance M is latent. While the agent knows S, s1,A, g, she does not\nknow v, p. To state the objective, de\ufb01ne \u00afV1:t := 1\nq=1 Vq(sq, aq). For any horizon T not known a\nt\npriori, the agent aims to maximize g( \u00afV1:T ), by selecting actions a1, . . . , aT with a non-anticipatory\npolicy. Denote \u00afV1:T,k as the k-component of the time average vector \u00afV1:T . CO-OMDPs capture the\nfollowing problems:\n\nMulti-Objective Optimization. Consider maximizing the scalar function(cid:80)K\n\n(cid:80)t\n\ns\u2208S (\u0001s \u2212(cid:80)T\n\nsquared error(cid:80)\n\nk=1 Lk \u00afV1:T,k, while\ntrying to meet the Key Performance Index (KPI) requirement \u00afV1:T,k \u2265 \u03c1k for each k \u2208 {1, . . . , K}.\nk=1 \u2208 [0, 1]K comprises the pre-determined KPI targets for the K objectives\nThe vector \u03c1 = (\u03c1k)K\n{ \u00afV1:T,k}K\nk=1. The task can be modelled as a CO-OMDP problem, by setting L0 \u2265 0, and U = {w :\nwk \u2265 \u03c1k \u22001 \u2264 k \u2264 K}. By putting \u03c1k = 1 and Lk = 0 for each k, any maximizer of g( \u00afV1:T ) is\nPareto-optimal for the simultaneous maximization of \u00afV1:T,1, . . . , \u00afV1:T,K. The Pareto optimality still\nholds when we replace the inequality wk \u2265 1 with wk \u2265 \u03c1UB\nk that bounds the average\n\u00afV1:T,k for any policy from above.\nState Space Exploration. Consider visiting each state s with empirical frequency as close as possible\nto a target frequency \u0001s in T time steps, where \u0001 = {\u0001s}s\u2208S \u2208 \u2206S. The task can be phrased as a\nCO-OMDP problem. For each state-action pair (s, a), we de\ufb01ne V (s, a) \u2208 {0, 1}S as the standard\nbasis vector for s in RS, with value 1 at the s-coordinate and value 0 at the others. In addition, set\nL0 = 1, L1 = . . . = LK = 0, U = {\u0001}. Maximizing g( \u00afV1:T ) is equivalent to minimizing the mean\nt=1 1st=s/T )2. To generalize, we can consider visiting certain subsets\n(not necessarily disjoint or covering) of S with some target frequencies.\nFinally, when we specialize the CO-OMDP problem with L0 = 0, we recover the SO-OMDP problem\n[28]. If we specialize with S = {s}, we recover the BwR problem [2] with reward function g.\nReachability of M. To learn the latent model, the agent has to travel among states. For any s, s(cid:48) \u2208 S\nand any stationary policy \u03c0, we de\ufb01ne the travel time from s to s(cid:48) under \u03c0 as the random variable\n\u039b(s(cid:48)|\u03c0, s) := min{t : st+1 = s(cid:48), s1 = s, s\u03c4 +1 \u223c p(\u00b7|s\u03c4 , \u03c0(s\u03c4 )) \u2200\u03c4}. We assume the following:\nAssumption 2.1. The latent CO-OMDP instance M is communicating, that is, the quantity D :=\nmaxs,s(cid:48)\u2208S minstationary \u03c0 E[\u039b(s(cid:48)|\u03c0, s)] is \ufb01nite. We call D the diameter of M.\nThe same reachability assumption is made in [28]. Since the instance M is latent, the corresponding\ndiameter D is also not known to the agent. Assumption 2.1 is weaker than the unichain assumption\n[6, 32, 41], where every stationary policy induces a single recurrent class on S.\nOf\ufb02ine Benchmark and Regret. To measure the effectiveness of a policy, we rephrase the agent\u2019s\nobjective as the minimization of regret: Reg(T ) := opt(PM) \u2212 g( \u00afV1:T ). The of\ufb02ine benchmark\nopt(PM) is the optimum of the convex optimization problem (P(g)), which serves as a \ufb02uid relaxation\n[38, 6] to the CO-OMDP problem.\n\nk , for any \u03c1UB\n\n\uf8eb\uf8ed (cid:88)\n\ns\u2208S,a\u2208As\n\n\uf8f6\uf8f8\n\nv(s, a)x(s, a)\n\n(cid:88)\n\n(PM): max\n\ng\n\nx\n\ns.t. (cid:88)\n(cid:88)\n\na\u2208As\n\ns\u2208S,a\u2208As\nx(s, a) \u2265 0\n\nx(s, a) =\n\ns(cid:48)\u2208S,a(cid:48)\u2208As(cid:48)\n\nx(s, a) = 1\n\np(s|s(cid:48), a(cid:48))x(s(cid:48), a(cid:48)) \u2200s \u2208 S\n\n\u2200s \u2208 S, a \u2208 As\n\n3\n\n(2a)\n\n(2b)\n\n(2c)\n\n\fIn (PM), the variables {x(s, a)}s,a form a probability distribution over the state-action pairs. The\nset of constraints (2a) requires the rates of transiting into and out of each state s to be equal.\nWe aim to design a non-anticipatory policy with an anytime regret bound Reg(T ) = O(1/T \u03b1) for\nsome \u03b1 > 0. That is, for all \u03b4 > 0, there exist constants c, C (which only depend on K, S, A, g, \u03b4),\nso that the policy satis\ufb01es Reg(T ) \u2264 c/T \u03b1 for all T \u2265 C with probability at least 1 \u2212 \u03b4. Achieving\nReg(T ) = O(1/T \u03b1) for some \u03b1 > 0 implies achieving near-optimality, since opt(PM) differs from\nthe expected optimum only by an additive error of O( \u00afLD/T ), by a similar reasoning to [28] (see\n[18] for details).\n\n3 Challenges of CO-OMDP, and Algorithm TFW-UCRL2\n\nWe \ufb01rst discuss some unique challenges in the CO-OMDP, then present and discuss TFW-UCRL2 in\nAlgorithm 1. Finally, we present the regret bound for TFW-UCRL2.\nChallenges. We begin by describing some unique challenges in CO-OMDP hinted in the Introduction.\nConsider the three instances in Fig 1. An arc from state s to s(cid:48) represents action a with p(s(cid:48)|s, a) = 1,\nand is labelled with its outcome V (s, a), which is deterministic. Let\u2019s focus on Figs 1a, 1b. The\ncommon objective requires balancing the 2-dimensional outcomes by visiting the left loop (ll) and\nthe right loop (rl) with frequency 0.5 each. In Fig 1a, the agent incurs a O(1/T ) regret by choosing\nll once, then rl once, then ll once, and so on.\n\n(cid:0)1\n(cid:1)\n\n(cid:0)0\n(cid:1)\nFigure 1: Instances, with opt. actions bolded. Insts (1a, 1b) have g(w) = \u2212(cid:80)2\n\n(b) CO-OMDP\n\n(cid:0)1\n(cid:1)\n\n(a) BwR\n\n(cid:0)0\n\ns0\n\n(cid:1)\n\ns0\n\n1\n\n1\n\n0\n\ns1\n\nk=1(wk \u2212 0.5)2/2.\n\n(c) SO-OMDP\n\n(cid:0)0\n(cid:0)0\n\n(cid:1)\n(cid:1)\n\n(cid:0)0\n(cid:0)0\n\n0\n\n(cid:1)\n(cid:1)\n\n0\n\n0\n\n0\n\n0\n\ns1\n\n0\n\ns2\n\n1\n\ns2\n\ns0\n\n1\n\n0\n\n1\n\n0\n\nHowever, if the agent visits ll once, then rl once, then ll once, and so on in Fig 1b, she suffers\nReg(T ) = \u2126(1). Indeed, she spends two third of the time at the actions with the \u2018sub-optimal\u2019 state\ns0, resulting in \u00afV1:T \u2248 (1/6, 1/6)(cid:62) for large T . While the agent should visit each loop multiple times\nbefore going to state s0 and then another loop, the length of stay at each loop is not a priori clear.\nOur Gradient Threshold Procedure (GTP) provides a principled way for determining these lengths,\nand GTP generalizes to other communicating MDPs. Finally, such a subtlety in state transitions does\nnot occur in Fig 1c or generally in communicating SO-OMDP instances, where the agent achieves\nnear-optimality by remaining in a single recurrent class.\nTFW-UCRL2 runs in episodes. Episode m starts at the beginning of time \u03c4 (m) and ends at the end\nof time \u03c4 (m + 1) \u2212 1. During episode m, the agent follows a certain stationary policy \u02dc\u03c0m. The start\ntimes {\u03c4 (m)}\u221e\nm=1 are decided adaptively. We maintain con\ufb01dence regions\nm = {H v\nH v\n\nm(s, a)}s,a on the latent v, p across episodes, by \ufb01rst de\ufb01ning\n\nm=1 and policies {\u02dc\u03c0m}\u221e\n\nm(s, a)}s,a, H p\n\nm = {H p\n\u03c4 (m)\u22121(cid:88)\n\nt=1\n\n\u03c4 (m)\u22121(cid:88)\n\n\u03c4 (m)\u22121(cid:88)\n\n4\n\nNm(s, a) =\n\n1(st,at)=(s,a), N +\n\nThe estimates and con\ufb01dence regions for v are:\n\nm(s, a) = max{1, Nm(s, a)}.\n(cid:32)(cid:115)\n\n1\n\n\u02c6vm(s, a) :=\n\nm(s, a) :=(cid:8)\u00afv \u2208 [0, 1]K : |\u00afvk \u2212 \u02c6vm,k(s, a)| \u2264 radv\n\nVt(st, at)1(st,at)=(s,a),\n\nm,k(s, a) = \u02dcO\n\nm,k(s, a) \u2200k \u2208 [K](cid:9) .\n\nm(s, a)\n\nradv\n\nN +\n\nt=1\n\nH v\n\n\u02c6vm,k(s, a)\nN +\nm(s, a)\n\nThe estimates and con\ufb01dence regions for p are:\n\n1\n\n\u02c6pm(s(cid:48)|s, a) :=\n\nm(s, a) :=(cid:8)\u00afp \u2208 \u2206S : |\u00afp(s(cid:48)) \u2212 \u02c6pm(s(cid:48)|s, a)| \u2264 radp\n\n1(st,at,st+1)=(s,a,s(cid:48)), radp\n\nm(s, a)\n\nN +\n\nt=1\n\nH p\n\n(cid:32)(cid:115)\nm(s(cid:48)|s, a) \u2200s(cid:48) \u2208 S(cid:9) .\n\nm(s(cid:48)|s, a) = \u02dcO\n\n(cid:33)\n\n\u02c6pm(s(cid:48)|s, a)\nN +\nm(s, a)\n\n(cid:33)\n\n(3)\n\n,\n\n(4)\n\n,\n\n(5)\n\n\f1\nK\n\nm,k(s, a), radp\n\nm(s(cid:48)|s, a) in Appendix B.2. We now explain\nWe provide the complete expressions of radv\nthe three vital components of TFW-UCRL2: (i) Frank-Wolfe (FW) [21], which has been adapted in\nrelated research on BwR [2, 14] and exploration problems in MDPs [25, 41], (ii) Extended Value\nIteration (EVI) [28], (iii) our crucial and novel Gradient Threshold Procedure (GTP).\nFrank Wolfe (FW) [21] provides a way to balance the vectorial outcome at each time step t. We\ndenote (cid:107) \u00b7 (cid:107)2 as the Euclidean norm, and de\ufb01ne \u03a0U (w) = argminu\u2208U(cid:107)u \u2212 w(cid:107)2. At time t, FW\nscalarizes the outcome in eqn (6) with the gradient\n\n\u2207g( \u00afV1:t\u22121) =\n\n(cid:2)(L1, . . . , LK)(cid:62) \u2212 L0( \u00afV1:t\u22121 \u2212 \u03a0U ( \u00afV1:t\u22121))(cid:3) .\n1, L1, . . . , LK = 0, U = {\u0001}. The s-component of \u2207g( \u00afV1:t\u22121) is (\u0001s \u2212(cid:80)t\u22121\n\nTo gain intuitions, consider State Space Exploration with target frequency \u0001, where L0 =\nq=1 1sq=s/(t \u2212 1))/K,\nwhich encourages visiting state s when its empirical frequency is below the target \u0001s. Sim-\nilarly, for Multi-Objective Optimization with KPI target \u03c1, the k-component of \u2207g( \u00afV1:t\u22121) is\n(Lk + L0 max{\u03c1k \u2212 \u00afV1:t\u22121,k, 0})/K. The agent is motivated to focus on the kth objective when\n\u00afV1:t\u22121,k \u2264 \u03c1k.\nExtended Value Iteration (EVI) [28] solves for an optimistic stationary policy for an SO-OMDP\nproblem, when v, p are not known. We extract EVI from [28] in Appendix B.3. Ideally, at the start\nof each episode m, the agent wishes to compute the optimal policy under the scalarized reward\n\u2207g( \u00afV1:\u03c4 (m)\u22121)(cid:62)v and transition kernel p. Since v, p are uncertain, the agent uses EVI [28] to\ncompute the stationary policy \u02dc\u03c0m in (7), which is optimal for the optimistic choices of \u02dcvm \u2208 H v\nand \u02dcpm \u2208 H p\nm\nm. By optimistic choices \u02dcvm, \u02dcpm, we mean that the resulting single objective MDP\nwith scalar rewards \u02dcrm = {\u02dcrm(s, a)}s,a, \u02dcrm(s, a) = \u2207g( \u00afV1:\u03c4 (m)\u22121)(cid:62)\u02dcvm(s, a) and transition kernel\n\u02dcp has the highest long term average reward, among all \u00afvm \u2208 H v\nm. The last argument\n\n1/(cid:112)\u03c4 (m) of EVI is an additive error term allowed for EVI. By [28], EVI converges to a stationary\n\nm, \u00afpm \u2208 H p\n\npolicy \u02dc\u03c0m in \ufb01nite time when H p\n\nm contains the transition kernel for a communicating MDP.\n\nAlgorithm 1 TFW-UCRL2 on g\n1: Inputs: Parameter \u03b4 \u2208 (0, 1), gradient threshold Q \u2265 0 (default Q = \u00afL/\n2: Initialize t = 1\n3: for Episode m = 1, 2, . . . do\n4:\n5:\n6:\n\nSet \u03c4 (m) = t, and initialize N +\nCompute the con\ufb01dence regions H v\nCompute the optimistic reward \u02dcrm = {\u02dcrm(s, a)}s\u2208S,a\u2208As:\n\nm, H p\n\nm(s, a) according to Eq (3) for each s \u2208 S, a \u2208 As.\n\nm respectively for v, p, according to Eqs (4, 5).\n\n\u221a\n\nK), initial state s1.\n\n\u02dcrm(s, a) =\n\nCompute a (1/(cid:112)\u03c4 (m))-optimal optimistic policy \u02dc\u03c0m:\n\nmax\n\u00afv(s,a)\u2208H v\n\nm(s,a)\n\n7:\n\n\u2207g( \u00afV1:\u03c4 (m)\u22121)(cid:62)\u00afv(s, a).\n\n\u02dc\u03c0m,\u2190 EVI(\u02dcrm, H p\n\nm; 1/(cid:112)\u03c4 (m)).\n\n(6)\n\n(7)\n\nInitialize \u03bdm(s, a) = 0 for each s, a, \u03b8ref = \u03b8\u03c4 (m) = \u2207g( \u00afV1:(\u03c4 (m)\u22121)), \u03a8 = 0.\nwhile \u03a8 \u2264 Q and \u03bdm(st, \u02dc\u03c0m(st)) < N +\n\nm(st, \u02dc\u03c0m(st)) do\n\nChoose action at = \u02dc\u03c0m(st).\nObserve the outcomes Vt(st, at) and the next state st+1.\nCompute gradient \u03b8t+1 = \u2207g( \u00afVt).\nUpdate \u03a8 \u2190 \u03a8 + (cid:107)\u03b8t+1 \u2212 \u03b8ref(cid:107)2.\nUpdate \u03bdm(st, at) \u2190 \u03bdm(st, at) + 1.\nUpdate t \u2190 t + 1.\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nend while\n\n(cid:46) Frank-Wolfe\n\nThe Gradient Threshold Procedure (GTP) maintains FW\u2019s balancing effect on the vectorial out-\ncomes, while overcoming the challenges in avoiding sub-optimal actions. GTP maintains a distance\nmeasure \u03a8 on the gradients generated by FW during each episode, and starts the next episode if the\nmeasure \u03a8 exceeds a threshold Q. A small Q makes the agent alternate among different stationary\n\n5\n\n\fpolicies frequently and balances the outcomes, while a large Q facilitates learning and avoids visiting\nsub-optimal states. A properly tuned Q paths the way to solve the CO-OMDP problem.\nA direct combination of FW and EVI corresponds to TFW-UCRL2 with Q = 0, which silences\nGTP and incurs Reg(T ) = \u2126(1) on the instance in Fig 1b. Let\u2019s assume start state to be s0, the\ncomplete knowledge of v, p, and consistent tie breaking. The agent would go to s2, take ll once,\nthen go to s1, take rl once, then back to s2 and take ll once, and so on (the same dynamics as in\nChallenges). Indeed, under the pure effect of FW, the agent is obsessed with balancing the outcomes.\nOnce \u00afV1:t\u22121,1 > \u00afV1:t\u22121,2, the scalarized reward for (s2, ll) is higher than that for (s1, rl), and she\ntravels to s2. Similarly, once \u00afV1:t\u22121,1 \u2264 \u00afV1:t\u22121,2, she travels to s1. In this process, she is oblivious to\nthe fact that constantly alternating between ll, rl penalizes her objective by constantly visiting s0.\nIn contrast, applying TFW-UCRL2 with 0 < Q < \u221e leads us to near-optimality. For example, with\n\u221a\n\u221a\nQ = \u00afL/\nK, the agent follows this interesting trajectory: Suppose the agent is at s0 at time t. If\n\u221a\n\u00afV1:t\u22121,1 > \u00afV1:t\u22121,2, then she would travel to s2, take ll for \u0398(\nQt) times, then head back to s0.\nQt) times, then head back to s0. Altogether, for\nOtherwise, she would travel to s1, take rl for \u0398(\n\nevery t we have \u00afV1:t\u22121,1, \u00afV1:t\u22121,2 = 0.5 \u00b1 O((cid:112)Q/t), and the agent only visits s0 for O((cid:112)t/Q)\n\nT ).\n\ntimes, leading to the anytime regret bound Reg(T ) = O((\nFinally, in another extreme case of Q = \u221e, in fact we have Reg(T ) = \u2126(1/SA log T ). Indeed,\nthe condition \u03a8 \u2264 Q is always satis\ufb01ed. By applying [28], the agent alternates among ll, rl only\nO(SA log T ) times in T time steps. This leads to an imbalance in the outcomes, since the agent\ncould stay at a loop for \u2126(T /SA log T ) time, and results in Reg(T ) = \u2126(1/SA log T ).\nMain Results. We establish regret bounds for TFW-UCRL2. Denote S := |S|, A := 1\ns\u2208S |As|,\nso SA is the number of state-action pairs. Denote \u0393 := maxs\u2208S,a\u2208As (cid:107)p(\u00b7|s, a)(cid:107)0, which is the\nmaximum number of states from which a state-action pair can transit to. We employ the \u02dcO(\u00b7) notation,\nwhich hides additive terms which scales with log(T /\u03b4)/T as well as multiplicative log(T /\u03b4) factors.\nTheorem 3.1. Consider TFW-UCRL2 with gradient threshold Q > 0, applied on a communicating\nCO-OMDP instance M with diameter D. With probability 1 \u2212 O(\u03b4), we have anytime regret bound\n\n(cid:80)\n\nS\n\nQ +(cid:112)1/Q)/\n\n\u221a\n\n\u221a\n\n(cid:16)(cid:104)(cid:112)L0Q +\n\n(cid:112)\n\nL0 \u00afLD/(cid:112)KQ\n\n(cid:105)\n\nK 1/4(cid:46)\u221a\n\nT\n\n(cid:17)\n\nReg(T ) = \u02dcO\n\n(cid:16) \u00afL(D + 1)\n\n\u221a\n\n+ \u02dcO\n\u221a\n\n\u221a\n\u0393SA/\n\n(cid:46)\u221a\n\n(cid:17)\n\n\u0393SA\n\nT\n\n.\n\n\u221a\n\n\u221a\n\nT ).\n\nK gives Reg(T ) = \u02dcO( \u00afL(D + 1)\n\nIn particular, setting Q = \u00afL/\nLet\u2019s focus on the \ufb01rst \u02dcO(\u00b7) term in the bound. The \u02dcO(\n\u221a\nQ) term represents the regret due to\nthe delay in gradient updates by GTP. The \u02dcO(1/\nQ) term represents the regret due to (a) the\ninterference of GTP with the learning of v, p, (b) the switches among stationary policies, which\ncould require visiting sub-optimal states. The second \u02dcO(\u00b7) term is the regret due to the simultaneous\nexploration-exploitation by EVI.\nBy specializing L0 = 0, L1, . . . , LK = 1, TFW-UCRL2 incurs Reg(T ) = \u02dcO(D\nT )\non SO-OMDP, which essentially matches [28]1. By specializing S = {s}, TFW-UCRL2 incurs\n\u221a\nReg(T ) = \u02dcO( \u00afL\nT ), which matches [2] on BwR on g. While our regret bounds match [28, 2]\nin those special cases, the design and analysis of TFW-UCRL2 require novel ideas that depart from\n[28, 2]. We design the novel GTP for handling state transitions. In the upcoming analysis, we show\nthat GTP is streamlined so that it achieves our regret bounds, without excessively interfering the\nbalancing by FW and the learning by EVI. Finally, note that TFW-UCRL2 is a non-stationary policy\nthat diversi\ufb01es across different stationary policies across time. Interestingly, non-stationary policy is\nnecessary achieving near-optimality, even when the model parameters are unchanging:\nClaim 3.2. Every stationary policy incurs an \u2126(1) anytime regret on the instance in Fig 1b.\n\n\u221a\n\u0393SA/\n\n\u221a\n\n\u221a\n\nA/\n\nThe Claim, proved in Appendix B.4, illustrates a profound difference between communicating\nCO-OMDPs and unichain CO-OMDPs, see Appendix B.4.\nMax E[g( \u00afV1:T )] vs max g(E[ \u00afV1:T ]). Our objective is to maximize g( \u00afV1:T ) (also leads to max\nE[g( \u00afV1:T )]), which crucially different from maximizing g(E[ \u00afV1:T ]). Now, for any policy, it holds that\n\u0393S, by\n\n1Jaksch et al. [28] achieve the regret bound \u02dcO(DS\n\nT ). The factor of S is improved to\n\n\u221a\nA/\n\n\u221a\n\n\u221a\n\napplying an empirical Bernstein inequality[7] instead of the Hoeffding inequality, as used in [23].\n\n6\n\n\fE[g( \u00afV1:T )] \u2264 g(E[ \u00afV1:T ]) \u2264 opt(PM) + O( \u00afLD/T ). The second inequality (formally proved in [18])\nis demonstrated by showing that, for any policy, the empirical frequency of visiting each state-action\npair is \u201cnearly\u201d a feasible solution to (PM), with the O( \u00afLD/T ) term capturing the error due to the\nnear feasibility. Under TFW-UCRL2, we know that E[g( \u00afV1:T )] tends to opt(PM) as T grows. Hence\nwe also have g(E[ \u00afV1:T ]) tending to opt(PM) as T grows.\nNevertheless, the converse is not true. Consider the instance in Fig 1b again, where the starting state\nis s0. Consider the following policy: At the start, the agent transits to either s1 or s2 with probability\n1/2. After that, the agent loops at that state inde\ufb01nitely. It is clear that Pr[ \u00afV1:T = (1 \u2212 1/T, 0)(cid:62)] =\nPr[ \u00afV1:T = (0, 1 \u2212 1/T )(cid:62)] = 1/2. On the one hand, we have g(E[ \u00afV1:T ]) = \u22121/(8T 2), which tends\nto opt(PM) = 0 as T \u2192 \u221e. On the other hand, we have E[g( \u00afV1:T )] = \u22121/8 + O(1/T ), which\ndoes not tend to opt(PM) = 0 as T \u2192 \u221e. Altogether, solving max g(E[ \u00afV1:T ]) to near-optimality\ndoes not lead to the near-optimality for max E[g( \u00afV1:T )].\nTo this end, it is worth mentioning that the related works in the discounted settings [24, 6, 12, 43,\nt=1 \u03b3tVt(st, at)]), where \u00afg is a certain non-linear\nfunction and \u03b3 \u2208 (0, 1) is the discounted factor. We envision that our technique could be useful for\n\n1, 25, 42, 29, 30, 33] focus on maximizing \u00afg(E[(cid:80)\u221e\nmaximizing E[\u00afg((cid:80)\u221e\n\nt=1 \u03b3tVt(st, at))] instead of maximizing \u00afg(E[(cid:80)\u221e\n\nt=1 \u03b3tVt(st, at)]).\n\nGeneralizations. While Theorem 3.1 concerns the specialized aggregate function (1), Cheung [18]\nrecently generalizes the algorithmic framework to any Lipschitz continuous and smooth function.\nBy adapting to the online mirror descent algorithm [34], Cheung [18] proposes another algorithm\nthat results in O(1/T 3) regret (We hide the dependence on M) for Lipschitz continuous concave\naggregate functions that are not necessarily smooth.\n\n4 Analysis of TFW-UCRL2\nIn this Section, we prove Theorem 3.1. To start, we consider events E v,E p and Lemma 4.1, which is\nproved in Appendix C.1. The shorthand \u2200m, s, a means \u2018for all m \u2208 N, s \u2208 S, a \u2208 As\u2019.\nm(s, a) \u2200m, s, a} .\n\nE p := {p(\u00b7|s, a) \u2208 H p\n\nE v := {v(s, a) \u2208 H v\n\nm(s, a) \u2200m, s, a} ,\n\nLemma 4.1. It holds that P[E v] \u2265 1 \u2212 \u03b4/2, P[E p] \u2265 1 \u2212 \u03b4/2.\n\n41]. De\ufb01ne v\u2217 :=(cid:80)\n\nWe decompose Reg(T ) with the analytical tools on FW [21, 16], which is also adapted in [2, 14, 25,\n\ns,a v(s, a)x\u2217(s, a), where x\u2217 is an optimal solution of (PM). We have\n\n(cid:107) \u00afV1:t \u2212 \u00afV1:t\u22121(cid:107)2\n\n2\n\n(8)\n\ng( \u00afV1:t) \u2265 g( \u00afV1:t\u22121) + \u2207g( \u00afV1:t\u22121)(cid:62)[ \u00afV1:t \u2212 \u00afV1:t\u22121] \u2212 L0\nK\n\u2207g( \u00afV1:t\u22121)(cid:62)[Vt(st, at) \u2212 \u00afV1:t\u22121] \u2212 L0\n\u2207g( \u00afV1:t\u22121)(cid:62)[v\u2217 \u2212 \u00afV1:t\u22121] +\n1\nt\n\n(cid:2)opt(PM) \u2212 g( \u00afV1:t\u22121)(cid:3) +\n\n=g( \u00afV1:t\u22121) +\n\u2265g( \u00afV1:t\u22121) +\n\u2265g( \u00afV1:t\u22121) +\n\n1\nt\n1\nt\n1\nt\n\nKt2(cid:107)Vt(st, at) \u2212 \u00afV1:t\u22121(cid:107)2\n\n2\n\n\u2207g( \u00afV1:t\u22121)(cid:62)[Vt(st, at) \u2212 v\u2217] \u2212 L0\n1\nt2\nt\n\u2207g( \u00afV1:t\u22121)(cid:62)[Vt(st, at) \u2212 v\u2217] \u2212 L0\nt2 .\n\n(9)\nStep (8) is by the property that g is (2L0/K)-smooth w.r.t. (cid:107)\u00b7(cid:107)2 on the domain [0, 1]K (see Appendix\nB.1). Rearranging (9) gives\n\nt \u00b7 Reg(t) \u2264 (t \u2212 1) \u00b7 Reg(t \u2212 1) +\n\n+ \u2207g( \u00afV1:t\u22121)(cid:62)[v\u2217 \u2212 Vt(st, at)].\n\nApply (10) recursively for t = T, . . . , 1, we obtain (recall that \u03b8t = \u2207g( \u00afV1:t\u22121)):\n\nReg(T ) \u2264 2L0 log T\n\nT\n\n+\n\n1\nT\n\nt [v\u2217 \u2212 Vt(st, at)].\n\u03b8(cid:62)\n\nL0\nt\n\nT(cid:88)\n\nt=1\n\n(10)\n\n(11)\n\nThe main analysis is now on the second term in (11), which requires novel technical analysis regarding\nthe dynamics of the gradient threshold procedure. We start with the following bound.\n\n7\n\n\fProposition 4.2. Consider an execution of TFW-UCRL2 on a communicating instance with diameter\nD. For each T \u2208 N, suppose that there is a deterministic constant M (T ) s.t. Pr[m(T ) \u2264 M (T )] = 1.\nConditioned on events E v,E p, with probability at least 1 \u2212 O(\u03b4) we have\n\n(cid:16)\n\n\u221a\n(Q\n\n(cid:17)\n\n(cid:16) \u00afL(D + 1)\n\n\u221a\n\n(cid:17)\n\nK + \u00afLD)M (T )\n\n+ \u02dcO\n\n\u0393SAT\n\n.\n\nt [v\u2217 \u2212 Vt(st, at)] = \u02dcO\n\u03b8(cid:62)\n\nT(cid:88)\n\nt=1\n\nm, H p\n\nProposition 4.2 bounds two sources of error: (i) the error due to GTP, (ii) the estimation errors\nassociated with H v\nm and EVI. Error (ii) can be upper bounded by the machinery in [28]. Error (i)\nconcerns the following discrepancy. For each time t in episode m, the action at = \u02dc\u03c0m(st) is chosen\nbased on policy \u02dc\u03c0m, which involves the scalarization by \u03b8\u03c4 (m). However, ideally the action at time t\nshould balance the current vectorial outcomes by the scalarization with \u03b8t. The Proposition bounds\nerror (i) by charging the discrepancy to the threshold Q and the upper bound M (T ). To complete the\nproof of bounding Reg(T ), we establish a bound M (T ) small enough to achieve Theorem 3.1.\nLemma 4.3. Consider an execution of TFW-UCRL2 with gradient threshold Q > 0. With certainty,\nfor every T \u2208 N we have m(T ) \u2264 M (T ) = \u02dcO(\n\n\u221a\nL0T /(\n\n(cid:113)\n\nKQ)).\n\nThe Lemma bounds the error of GTP in balancing the outcomes. Under FW, the gradients change at\na rate O(1/t), which is slow enough for the agent to judiciously delay the gradient updates without\nsacri\ufb01cing their balancing effect too much. This opens the door to avoid visiting sub-optimal states\ntoo frequently.\nSketch Proof of Lemma 4.3. First, observe that {1, . . . , m(T )} is the union of\n\nM\u03a8(T ) := {m \u2208 N : \u03c4 (m) \u2264 T , episode m + 1 is started due to \u03a8 \u2265 Q} ,\nM\u03bd(T ) := {m \u2208 N : \u03c4 (m) \u2264 T , episode m + 1 is started due to\n\nm(st, \u02dc\u03c0m(st)) for some t \u2265 \u03c4 (m)(cid:9) ,\n\n\u03bdm(st, \u02dc\u03c0m(st)) \u2265 N +\n\nTo prove the Lemma, it suf\ufb01ces to show that:\n\u221a\n|M\u03a8(T )| \u2264 M\u03a8(T ) := 1 + (\n|M\u03bd(T )| \u2264 M\u03bd(T ) := SA(1 + log2 T ).\n\n(12)\n(13)\nThe bound (13) follows from [28]. Thus, we focus on showing bound (12). Let\u2019s express M\u03a8(T ) =\n{m1, m2, . . . , mn\u03a8}, where m1 < m2 < . . . < mn\u03a8. We also de\ufb01ne m0 = 0. We focus on an\nepisode index mj with j \u2265 1, and consider for each t \u2208 {\u03c4 (mj) + 1, . . . , \u03c4 (mj + 1)} the difference\n(cid:107)\u03b8t \u2212 \u03b8\u03c4 (mj )(cid:107)2. In the following, we argue that the gradients under FW changes slowly:\n\n\u221a\n2L0T /(\n\nKQ/2L0) + 4\n\nKQ),\n\n(cid:113)\n\n(cid:107)\u03b8t \u2212 \u03b8\u03c4 (mj )(cid:107)2 = (cid:107)\u2207g(V1:t\u22121) \u2212 \u2207g(V1:\u03c4 (mj )\u22121)(cid:107)2\n\u2264 L0\nK\n\n\u03c4 (mj )\u22121(cid:88)\n\nVq(sq, aq) \u2212\n\n\u03c4 (mj) \u2212 1\n\nt\u22121(cid:88)\n\n1\n\nq=1\n\nVq(sq, aq)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nVq(sq, aq) \u2212\n\n1\n\n\u03c4 (mj) \u2212 1\n\n\u03c4 (mj )\u22121(cid:88)\n\nq=1\n\nVq(sq, aq)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n=\n\nq=1\n\nL0\nK\n\nt \u2212 1\n\u00b7 t \u2212 \u03c4 (mj)\n\u00b7\nt \u2212 1\n\u00b7 t \u2212 \u03c4 (mj)\nt \u2212 1\n\nt\u22121(cid:88)\n\u00b7 t \u2212 \u03c4 (mj)\nSince mj \u2208 M\u03a8(T ), we know that(cid:80)\u03c4 (mj +1)\n\nt \u2212 \u03c4 (mj)\n\u2264 2L0\u221a\nK\n\n\u2264 2L0\u221a\nK\n\nq=\u03c4 (mj )\n\n1\n\n.\n\n(\u03c4 (mj + 1) \u2212 \u03c4 (mj\u22121 + 1))2\n\n\u03c4 (mj\u22121 + 1)\n\n\u03c4 (mj)\nt=\u03c4 (mj ) (cid:107)\u03b8t \u2212 \u03b8\u03c4 (mj )(cid:107)2 > Q, which means\nt \u2212 \u03c4 (mj)\n\u03c4 (mj)\n\n\u2265 (\u03c4 (mj + 1) \u2212 \u03c4 (mj))2\n\n\u03c4 (mj +1)(cid:88)\n\n\u03c4 (mj)\n\n\u2265\n\nt=\u03c4 (mj )\n\n\u221a\n\nKQ\n2L0\n\n>\n\n, (14)\n\nInequality (14) says that, since gradients change slowly, the time indexes {\u03c4 (mj + 1)}n\u03a8\nj=1 have to be\n\u221a\nfar apart. Thus, n\u03a8 can be bounded from above. Indeed, by some technical arguments (see Appendix\nC.2), inequality (14) turns out to imply \u03c4 (m(cid:100)Q(cid:48)(cid:101)+j +1) \u2265 Q(cid:48)(j\u22121)2/16, where Q(cid:48) =\nKQ/(2L0).\nWith j = n\u03a8 \u2212 1, we get Q(cid:48)(n\u03a8 \u2212 2)2/16 \u2264 \u03c4 (m(cid:100)Q(cid:48)(cid:101)+n\u03a8\u22121 + 1) \u2264 T , leading to (12).\nCombining the bound (11), Proposition 4.2 and Lemma 4.3, we have proved Theorem 3.1.\n\n8\n\n\f5 Numerical Experiments\n\nWe empirically evaluate TFW-UCRL2 on State Space Exploration on 3 instances: Small, Medium,\nLarge. These instances are detailed in Appendix A.1. In Fig 2a, TFW-UCRL2 is simulated on each\n\u221a\ninstance and each Q for 25 times. In Figs 2b, 2c, TFW-UCRL2 is simulated on each instance and\nQ = \u00afL/\nK for 25 times. Each curve plots the averages across the 25 trials, and each error bar\nquanti\ufb01es the \u00b1 standard deviation error region. Fig 2a depicts Reg(105) under different Qs. While\n\u221a\nextremely small or large Q leads to a large regret, TFW-UCRL2 seems robust in the middle range of\nQ. Our default Q = \u00afL/\nK (green dot) is motivated by our analysis, and it does not optimize the\nempirical performance. Tuning Q online is an interesting research direction.\n\n(a) Reg(105) under different Q\n\n(b) Reg(T ) as T grows\n\n(c) Simultaneous Convergence\n\nFigure 2: Simulation Results on State Space Exploration\n\nFig 2b demonstrates the trend of Reg(T ) as T grows, in log-log scales. The performance of TFW-\nUCRL2 is in contrast with the random policy (in yellow), which samples an action uniformly at\nrandom at every state. The Reg(T ) under TFW-UCRL2 converges to 0 as T grows, while the Reg(T )\nunder the random policy is constant. The slight wiggling in the plots for TFW-UCRL2 is due to GTP,\nwhich could deteriorate the objective in the short term but still leads to near-optimality eventually.\nFig. 2c highlights the simultaneous convergence of each objective to its target on the Large instance.\nThe instance involves a star graph with a center state and 12 branch states. The objectives are to visit\nthe center state with frequency 0 (Obj 1) and to visit each branch state with frequency 1/12 = 0.83\n(Objs 2, . . . 13). These target frequencies (in dashed black) are not realizable, and we plot (in dotted\ns,a v(s, a)x\u2217(s, a), where x\u2217 is an optimal solution of\n(PM). Along with the complete plot in Fig 4 in Appendix A.2, we see that the outputs {V1:T,k}13\nby TFW-UCRL2 (in solid blue) simultaneously converge to all the 13 target frequencies.\n\ncyan) the frequencies indicated by v\u2217 =(cid:80)\n\nk=1\n\nReferences\n[1] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings\nof the 34th International Conference on Machine Learning - Volume 70, ICML\u201917, pages 22\u201331.\nJMLR.org, 2017.\n\n[2] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In ACM\n\nConference on Economics and Computation, 2014.\n\n[3] S. Agrawal and N. R. Devanur. Linear contextual bandits with knapsacks. In Advances in Neural\nInformation Processing Systems 29: Annual Conference on Neural Information Processing\nSystems 2016, December 5-10, 2016, Barcelona, Spain, pages 3450\u20133458, 2016.\n\n[4] S. Agrawal, N. R. Devanur, and L. Li. An ef\ufb01cient algorithm for contextual bandits with\nknapsacks, and an extension to concave objectives. In Proceedings of the 29th Conference on\nLearning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 4\u201318, 2016.\n\n[5] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case\nregret bounds. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1184\u2013\n1194. Curran Associates, Inc., 2017.\n\n[6] E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.\n\n9\n\n10\u2212210\u22121100101102Q10\u22124Reg(105)SmallMediumLarge103104T10\u2212410\u2212310\u22122Reg(T)TFW-UCRL2 SmallTFW-UCRL2 MediumTFW-UCRL2 LargeRandom SmallRandom MediumRandom Large0.00.2Obj 1 by algoObj 1 by PMTarget for Obj 11031040.000.05Obj 2 by algoObj 2 by PMTarget for Obj 2\f[7] J. Audibert, R. Munos, and C. Szepesv\u00e1ri. Exploration-exploitation tradeoff using variance\n\nestimates in multi-armed bandits. Theor. Comput. Sci., 410(19):1876\u20131902, 2009.\n\n[8] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement\nlearning. In Advances in Neural Information Processing Systems, pages 49\u201356. MIT Press,\n2006.\n\n[9] Y. Azar, U. Felge, M. Feldman, and M. Tennenholtz. Sequential decision making with vector\noutcomes. In Proceedings of the 5th Conference on Innovations in Theoretical Computer\nScience, ITCS \u201914, pages 195\u2013206, New York, NY, USA, 2014. ACM.\n\n[10] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In Proceedings of the\n2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS \u201913, pages\n207\u2013216. IEEE Computer Society, 2013.\n\n[11] A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Proceedings\nof The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014,\npages 1109\u20131134, 2014.\n\n[12] L. Barrett and S. Narayanan. Learning all optimal policies with multiple criteria. In Proceedings\nof the 25th International Conference on Machine Learning, ICML \u201908, pages 41\u201347, New York,\nNY, USA, 2008. ACM.\n\n[13] P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement\nIn UAI 2009, Proceedings of the Twenty-Fifth\nlearning in weakly communicating mdps.\nConference on Uncertainty in Arti\ufb01cial Intelligence, Montreal, QC, Canada, June 18-21, 2009,\npages 35\u201342, 2009.\n\n[14] Q. Berthet and V. Perchet. Fast rates for bandit optimization with upper-con\ufb01dence frank-wolfe.\n\nIn Advances in Neural Information Processing Systems 30, pages 2225\u20132234. 2017.\n\n[15] D. P. Bertsekas. Dynamic programming and optimal control, 3rd Edition. Athena Scienti\ufb01c,\n\n2005.\n\n[16] S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,\n\n8(3-4):231\u2013357, Nov. 2015.\n\n[17] R. Busa-Fekete, B. Sz\u00f6r\u00e9nyi, P. Weng, and S. Mannor. Multi-objective bandits: Optimizing\nthe generalized Gini index. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 625\u2013634, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017.\nPMLR.\n\n[18] W. C. Cheung. Exploration-exploitation trade-off in reinforcement learning on online markov\n\ndecision processes with global concave rewards. CoRR, abs/1905.06466, 2019.\n\n[19] E. Even-Dar, R. Kleinberg, S. Mannor, and Y. Mansour. Online learning with global cost\n\nfunctions. In 22nd Annual Conference on Learning Theory, COLT, 2009.\n\n[20] K. J. Ferreira, D. Simchi-Levi, and H. Wang. Online network revenue management using\n\nthompson sampling. Operations Research, 66(6):1586\u20131602, 2018.\n\n[21] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3(1-2):95\u2013110, March - June 1956.\n\n[22] R. Fruit, M. Pirotta, and A. Lazaric. Near optimal exploration-exploitation in non-\ncommunicating markov decision processes. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-\nman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 31, pages 2998\u20133008. Curran Associates, Inc., 2018.\n\n[23] R. Fruit, M. Pirotta, A. Lazaric, and R. Ortner. Ef\ufb01cient bias-span-constrained exploration-\nexploitation in reinforcement learning. In J. Dy and A. Krause, editors, Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 1578\u20131586, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n10\n\n\f[24] Z. G\u00e1bor, Z. Kalm\u00e1r, and C. Szepesv\u00e1ri. Multi-criteria reinforcement learning. In Proceedings\nof the Fifteenth International Conference on Machine Learning, ICML \u201998, pages 197\u2013205, San\nFrancisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.\n\n[25] E. Hazan, S. M. Kakade, K. Singh, and A. V. Soest. Provably ef\ufb01cient maximum entropy\n\nexploration. CoRR, 2018.\n\n[26] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican statistical association, 58(301):13\u201330, 1963.\n\n[27] N. Immorlica, K. A. Sankararaman, R. E. Schapire, and A. Slivkins. Adversarial bandits with\n\nknapsacks. CoRR, 2018.\n\n[28] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. J.\n\nMach. Learn. Res., 11:1563\u20131600, Aug. 2010.\n\n[29] H. Le, C. Voloshin, and Y. Yue. Batch policy learning under constraints. In K. Chaudhuri\nand R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine\nLearning, volume 97 of Proceedings of Machine Learning Research, pages 3703\u20133712, Long\nBeach, California, USA, 09\u201315 Jun 2019. PMLR.\n\n[30] L. Lee, B. Eysenbach, E. Parisotto, E. P. Xing, S. Levine, and R. Salakhutdinov. Ef\ufb01cient\n\nexploration via state marginal matching. CoRR, abs/1906.05274, 2019.\n\n[31] C. Liu, X. Xu, and D. Hu. Multiobjective reinforcement learning: A comprehensive overview.\nIEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(3):385\u2013398, March 2015.\n[32] S. Mannor and N. Shimkin. A geometric approach to multi-criterion reinforcement learning. J.\n\nMach. Learn. Res., 5:325\u2013360, Dec. 2004.\n\n[33] S. Miryoose\ufb01, K. Brantley, H. D. III, M. Dud\u00edk, and R. E. Schapire. Reinforcement learning\n\nwith convex constraints. To appear in NeurIPS 2019, 2019.\n\n[34] A. Nemirovski and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley Interscience Series in discrete mathematics, 1983.\n\n[35] J. R. Norris. Markov chains. Cambridge series in statistical and probabilistic mathematics.\n\nCambridge University Press, 1998.\n\n[36] R. Ortner. Regret bounds for reinforcement learning via markov chain concentration. CoRR,\n\n2018.\n\n[37] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain. Learning unknown markov decision processes:\nA thompson sampling approach. In Advances in Neural Information Processing Systems 30,\npages 1333\u20131342. 2017.\n\n[38] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.\n\n[39] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential\n\ndecision-making. J. Artif. Int. Res., 48(1):67\u2013113, Oct. 2013.\n\n[40] A. Rosenberg and Y. Mansour. Online convex optimization in adversarial Markov decision\nprocesses. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International\nConference on Machine Learning, volume 97 of Proceedings of Machine Learning Research,\npages 5478\u20135486, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR.\n\n[41] J. Tarbouriech and A. Lazaric. Active exploration in markov decision processes. In K. Chaud-\nhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of\nProceedings of Machine Learning Research, pages 974\u2013982. PMLR, 16\u201318 Apr 2019.\n\n[42] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In\n\nInternational Conference on Learning Representations, 2019.\n\n[43] K. Van Moffaert and A. Now\u00e9. Multi-objective reinforcement learning using sets of pareto\n\ndominating policies. J. Mach. Learn. Res., 15(1):3483\u20133512, Jan. 2014.\n\n11\n\n\f", "award": [], "sourceid": 379, "authors": [{"given_name": "Wang Chi", "family_name": "Cheung", "institution": "Department of Industrial Systems Engineering and Management, National University of Singapore"}]}