{"title": "Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 2492, "page_last": 2501, "abstract": "Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems. In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling. We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains.", "full_text": "Using Options and Covariance Testing for Long\n\nHorizon Off-Policy Policy Evaluation\n\nZhaohan Daniel Guo\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nzguo@cs.cmu.edu\n\nPhilip S. Thomas\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\npthomas@cs.umass.edu\n\nEmma Brunskill\nStanford University\nStanford, CA 94305\n\nebrun@cs.stanford.edu\n\nAbstract\n\nEvaluating a policy by deploying it in the real world can be risky and costly.\nOff-policy policy evaluation (OPE) algorithms use historical data collected from\nrunning a previous policy to evaluate a new policy, which provides a means for\nevaluating a policy without requiring it to ever be deployed. Importance sampling\nis a popular OPE method because it is robust to partial observability and works\nwith continuous states and actions. However, the amount of historical data required\nby importance sampling can scale exponentially with the horizon of the problem:\nthe number of sequential decisions that are made. We propose using policies over\ntemporally extended actions, called options, and show that combining these policies\nwith importance sampling can signi\ufb01cantly improve performance for long-horizon\nproblems. In addition, we can take advantage of special cases that arise due to\noptions-based policies to further improve the performance of importance sampling.\nWe further generalize these special cases to a general covariance testing rule that\ncan be used to decide which weights to drop in an IS estimate, and derive a new IS\nalgorithm called Incremental Importance Sampling that can provide signi\ufb01cantly\nmore accurate estimates for a broad class of domains.\n\n1\n\nIntroduction\n\nOne important problem for many high-stakes sequential decision making under uncertainty domains,\nincluding robotics, health care, education, and dialogue systems, is estimating the performance of a\nnew policy without requiring it to be deployed. To address this, off-policy policy evaluation (OPE)\nalgorithms use historical data collected from executing one policy (called the behavior policy), to\npredict the performance of a new policy (called the evaluation policy). Importance sampling (IS)\nis one powerful approach that can be used to evaluate the potential performance of a new policy\n[12]. In contrast to model based approaches to OPE [5], importance sampling provides an unbiased\nestimate of the performance of the evaluation policy. In particular, importance sampling is robust\nto partial observability, which is often prevalent in real-world domains. Unfortunately, importance\nsampling estimates of the performance of the evaluation policy can be inaccurate when the horizon\nof the problem is long: the variance of IS estimators can grow exponentially with the number of\nsequential decisions made in an episode. This is a serious limitation for applications that involve\ndecisions made over tens or hundreds of steps, like dialogue systems where a conversation might\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\frequire dozens of responses, or intelligent tutoring systems that make dozens of decisions about how\nto sequence the content shown to a student.\nDue to the importance of OPE, there have been many recent efforts to improve the accuracy of impor-\ntance sampling. For example, Dud\u00edk et al. [4] and Jiang and Li [7] proposed doubly robust importance\nsampling estimators that can greatly reduce the variance of predictions when an approximate model of\nthe environment is available. Thomas and Brunskill [16] proposed an estimator that further integrates\nimportance sampling and model-based approaches, and which can greatly reduce mean squared error.\nThese approaches trade-off between the bias and variance of model-based and importance sampling\napproaches, and result in strongly consistent estimators. Unfortunately, in long horizon settings,\nthese approaches will either create estimates that suffer from high variance or exclusively rely on the\nprovided approximate model, which can have high bias. Other recent efforts that estimate a value\nfunction using off-policy data rather than just the performance of a policy [6, 11, 19] also suffer from\nbias if the input state description is not Markovian (such as if the domain description induces partial\nobservability).\nTo provide better off policy estimates in long horizon domains, we propose leveraging temporal\nabstraction. In particular, we analyze using options-based policies (policies with temporally extended\nactions) [14] instead of policies over primitive actions. We prove that the we can obtain an exponential\nreduction in the variance of the resulting estimates, and in some cases, cause the variance to be\nindependent of the horizon. We also demonstrate this bene\ufb01t with simple simulations. Crucially, our\nresults can be equivalently viewed as showing that using options can drastically reduce the amount of\nhistorical data required to obtain an accurate estimate of a new evaluation policy\u2019s performance.\nWe also show that using options-based policies can result in special cases which can lead to signi\ufb01cant\nreduction in estimation error through dropping importance sampling weights. Furthermore, we\ngeneralize the idea of dropping weights and derive a covariance test that can be used to automatically\ndetermine which weights to drop. We demonstrate the potential of this approach by constructing a\nnew importance sampling algorithm called Incremental Importance Sampling (INCRIS) and show\nempirically that it can signi\ufb01cantly reduce estimation error.\n\n2 Background\n\nG(\u03c4 ) =(cid:80)H\n\nWe consider an agent interacting with a Markov decision process (MDP) for a \ufb01nite sequence of time\nsteps. At each time step the agent executes an action, after which the MDP transitions to a new state\nand returns a real valued reward. Let s \u2208 S be a discrete state, a \u2208 A be a discrete action, and r be\nthe reward bounded in [0, Rmax]).\nThe transition and reward dynamics are unknown and are denoted by the transition probability\nT (s(cid:48)|s, a) and reward density R(r|s, a). A primitive policy maps histories to action probabilities,\ni.e., \u03c0(at|s1, a1, r1, . . . , st) is the probability of executing action at at time step t after encountering\nhistory s1, a1, r1, . . . , st. The return of a trajectory \u03c4 of H steps is simply the sum of the rewards\nt=1 rt. Note we consider the undiscounted setting where \u03b3 = 1. The value of policy \u03c0 is\nthe expected return when running that policy: V\u03c0 = E\u03c0(G(\u03c4 )).\nTemporal abstraction can reduce the computational complexity of planning and online learning\n[2, 9, 10, 14]. One popular form of temporal abstraction is to use sub-policies, in particular options\n[14]. Let \u2126 be the space of trajectories. o, an option, consists of \u03c0, a primitive policy (a policy over\nprimitive actions), \u03b2 : \u2126 \u2192 [0, 1], a termination condition where \u03b2(\u03c4 ) is the probability of stopping\nthe option given the current partial trajectory \u03c4 \u2208 \u2126 from when this option began, and I \u2282 S, an input\nset where s \u2208 I denotes the states where o is allowed to start. Primitive actions can be considered as\na special case of options, where the options always terminate after a single step. \u00b5(ot|s1, a1, . . . , st)\ndenotes the probability of picking option ot given history (s1, a1, . . . , st) when the previous option\nhas terminated, according to options-based policy \u00b5. A high-level trajectory of length k is denoted\nby T = (s1, o1, v1, s2, o2, v2, . . . , sk, ok, vk) where vt is the sum of the rewards accumulated when\nexecuting option ot.\nIn this paper we will consider batch, of\ufb02ine, off-policy evaluation of policies for sequential decision\nmaking domains using both primitive action policies and options-based policies. We will now\nintroduce the general OPE problem using primitive policies: in a later section we will combine this\nwith options-based policies.\n\n2\n\n\f1 , r(i)\n\n1 , s(i)\n\n2 , r(i)\n\n2 , a(i)\n\nIn OPE we assume access to historical data, D, generated by an MDP, and a behavior policy\n,\u03c0b. D consists of n trajectories, {\u03c4 (i)}n\ni=1. A trajectory has length H, and is denoted by \u03c4 (i) =\n(s(i)\n1 , a(i)\nH , a(i)\nH , r(i)\nH ). In off-policy evaluation, the goal is to use the data\nD to estimate the value of an evaluation policy \u03c0e: V\u03c0e. As D was generated from running the\nbehavior policy \u03c0b, we cannot simply use the Monte Carlo estimate. An alternative is to use\nimportance sampling to reweight the data in D to give greater weight to samples that are likely under\n\u03c0e and lesser weight to unlikely ones. We consider per-decision importance sampling (PDIS) [12],\nwhich gives the following estimate of the value of \u03c0e:\n\n2 , . . . , s(i)\n\nPDIS(D) =\n\n1\nn\n\nt r(i)\n\u03c1(i)\n\nt\n\n,\n\n\u03c1(i)\nt =\n\ni=1\n\nt=1\n\nu=1\n\n\u03c0e(a(i)\n\u03c0b(a(i)\n\nu |s(i)\nu )\nu |s(i)\nu )\n\n,\n\n(1)\n\n(cid:32) H(cid:88)\n\nn(cid:88)\n\n(cid:33)\n\nt(cid:89)\n\nwhere \u03c1(i)\nt\nestimator is an unbiased estimator of the value of \u03c0e:\n\nis the weight given to the rewards to correct due to the difference in distribution. This\n\nE\u03c0e (G(\u03c4 )) = E\u03c0b (P DIS(\u03c4 )),\n\n(2)\n\nwhere E\u03c0(. . . ) is the expected value given that the trajectories \u03c4 are generated by \u03c0.\nFor simplicity, hereafter we assume that primitive and options-based policies are a function only of\nthe current state, but our results apply also when the they are a function of the history. Note that\nimportance sampling does not assume that the states in the trajectory are Markovian, and is thus\nrobust to error in the state representation, and in general, robust to partial observability as well.\n\n3\n\nImportance Sampling and Long Horizons\n\nWe now show how the amount of data required for importance sampling to obtain a good off-policy\nestimate can scale exponentially with the problem horizon. Notice that in the standard importance\nsampling estimator, the weight is the product of the ratio of action probabilities. We now prove that\nthis can cause the variance of the policy estimate to be exponential in H.1\nTheorem 1. The mean squared error of the PDIS estimator can be \u2126(2H ). Proof. See appendix.\n\nEquivalently, this means that achieving a desired mean squared error of \u0001 can require a number of\ntrajectories that scales exponentially with the horizon. A natural question is whether this issue also\narises in a weighted importance sampling [13], a popular (biased) approach to OPE that has lower\nvariance. We show below that the long horizon problem still persists.\nTheorem 2. It can take \u2126(2H ) trajectories to shrink the MSE of weighted importance sampling\n(WIS) by a constant. Proof. See appendix.\n\n4 Combining Options and Importance Sampling\n\nWe will show that one can leverage the advantage of options to mitigate the long horizon problem. If\nthe behavior and evaluation policies are both options-based policies, then the PDIS estimator can be\nexponentially more data ef\ufb01cient compared to using primitive behavior and evaluation policies.\nDue to the structure in options-based policies, we can decompose the difference between the behavior\npolicy and the evaluation policy in a natural way. Let \u00b5b be the options-based behavior policy and\n\u00b5e be the options-based evaluation policy. First, we examine the probabilities over the options. The\nprobabilities \u00b5b(ot|st) and \u00b5e(ot|st) can differ and contribute a ratio of probabilities as an importance\nsampling weight. Second, the underlying policy, \u03c0, for an option, ot, present in both \u00b5b and \u00b5e may\ndiffer, and this also contributes to the importance sampling weights. Finally, additional or missing\noptions can be expressed by setting the probabilities over missing options to be zero for either \u00b5b or\n\u00b5e. Using this decomposition, we can easily apply PDIS to options-based policies.\nTheorem 3. Let O be the set of options that have the same underlying policies between \u00b5b and \u00b5e.\nLet O be the set of options that have changed underlying policies. Let k(i) be the length of the i-th\nhigh level trajectory from data set D. Let j(i)\nt be the length of the sub-trajectory produced by option\no(i)\nt\n\n. The PDIS estimator applied to D is\n\n1These theorems can be seen as special case instantiations of Theorem 6 in [8] with simpler, direct proofs.\n\n3\n\n\f\uf8f6\uf8f8\n\nw(i)\nt y(i)\n\nt\n\n\uf8eb\uf8ed k(i)(cid:88)\nn(cid:88)\n(cid:80)j(i)\n\nif o(i)\nb=1 \u03c1(i)\n\nv(i)\nt\n\nt=1\n\ni=1\n\nt\n\nt \u2208 O\nt,br(i)\n\nt,b if o(i)\n\nt \u2208 O\n\n1\nn\n\n(cid:40)\n\nP DIS(D) =\n\ny(i)\nt =\n\nt(cid:89)\nt(cid:89)\n\nj(i)\n\nu=1\n\nc=1\n\nw(i)\n\nt =\n\n\u03c1(i)\nt,b =\n\n\u00b5e(o(i)\n\u00b5b(o(i)\n\nu |s(i)\nu )\nu |s(i)\nu )\nt,c|s(i)\nt,c|s(i)\n\n\u03c0e(a(i)\n\u03c0b(a(i)\n\nt,c, o(i)\nt )\nt,c, o(i)\nt )\n\n,\n\n(4)\n\n,\n\n(3)\n\nwhere r(i)\n\nt,b is the b-th reward in the sub-trajectory of option o(i)\n\nt and similarly for s and a.\n\nProof. This is a straightforward application of PDIS to the options-based policies using the decom-\nposition mentioned.\n\nTheorem 3 expresses the weights in two parts: one part comes from the probabilities over options\nwhich is expressed as w(i)\n, and another part comes from the underlying primitive policies of options\nt\nthat have changed with \u03c1(i)\nt,b. We can immediately make some interesting observations below.\nCorollary 1. If no underlying policies for options are changed between \u00b5b and \u00b5e, and all options\nhave length at least J steps, then the worst case variance of PDIS is exponentially reduced from\n\u2126(2H ) to \u2126(2(H/J))\n\nCorollary 1 follows from Theorem 3. Since no underlying policies are changed, then the only\nimportance sampling weights left are w(i)\n. Thus we can focus our attention only on the high-level\nt\ntrajectory which has length at most H/J. Effectively, the horizon has shrunk from H to H/J, which\nresults in an exponential reduction of the worst case variance of PDIS.\nCorollary 2. If the probabilities over options are the same between \u00b5b and \u00b5e, and a subset of\noptions O have changed their underlying policies, then the worst case variance of PDIS is reduced\nfrom \u2126(2H ) to \u2126(2K) where K is an upper bound on the sum of the lengths of the options.\n\nCorollary 2 follows from Theorem 3. The options whose underlying policies are the same between\nbehavior and evaluation can effectively be ignored, and cut out of the trajectories in the data. This\nleaves only options whose underlying policies have changed, shrinking down the horizon from H to\nthe length of the leftover options. For example, if only a single option of length 3 is changed, and the\noption appears once in a trajectory, then the horizon can be effectively reduced to just 3. This result\ncan be very powerful, as the reduced variance becomes independent of the horizon H.\n\n5 Experiment 1: Options-based Policies\n\nThis experiment illustrates how using options-based policies can signi\ufb01cantly improve the accuracy\nof importance-sampling-based estimators for long horizon domains. Since importance sampling\nis particularly useful when a good model of the domain is unknown and/or the domain involves\npartial observability, we introduce a partially observable variant of the popular Taxi domain [3] called\nNoisyTaxi for our simulations (see \ufb01gure 1).\n\n5.1 Partially Observable Taxi\n\nFigure 1: Taxi Domain [3]. It is a 5\u00d75 gridworld (Figure 1). There are 4 special\nlocations: R,G,B,Y. A passenger starts randomly at one of the 4 locations, and\nits destination is randomly chosen from one of the 4 locations. The taxi starts\nrandomly on any square. The taxi can move one step in any of the 4 cardinal\ndirections N,S,E,W, as well as attempt to pickup or drop off the passenger. Each\nstep has a reward of \u22121. An invalid pickup or dropoff has a \u221210 reward and a\nsuccessful dropoff has a reward of 20.\n\nIn NoisyTaxi, the location of the taxi and the location of the passenger is partially observable. If the\nrow location of the taxi is c, the agent observes c with probability 0.85, c + 1 with probability 0.075\n\n4\n\n\fand c \u2212 1 with probability 0.075 (if adding or subtracting 1 would cause the location to be outside\nthe grid, the resulting location is constrained to still lie in the grid). The column location of the taxis\nis observed with the same noisy distribution. Before the taxi successfully picks up the passenger, the\nobservation of the location of the passenger has a probability of 0.15 of switching randomly to one of\nthe four designated locations. After the passenger is picked up, the passenger is observed to be in the\ntaxi with 100% probability (e.g. no noise while in the taxi).\n\n5.2 Experimental Results\nWe consider \u0001-greedy option policies, where with probability 1 \u2212 \u0001 the policy samples the optimal\noption, and probability \u0001 the policy samples a random option. Options in this case are n-step policies,\nwhere \u201coptimal\u201d options involve taking n-steps of the optimal (primitive action) policy, and \u201crandom\u201d\noptions involve taking n random primitive actions.2 Our behavior policies \u03c0b will use \u0001 = 0.3 and\nour evaluation policies \u03c0e use \u0001 = 0.05. We investigate how the accuracy of estimating \u03c0e varies as\na function both of the number of trajectories and the length of the options n = 1, 2, 3. Note n = 1\ncorresponds to having a primitive action policy.\nEmpirically, all behavior policies have essentially the same performance. Similarly all evaluation\npolicies have essentially the same performance. We \ufb01rst collect data using the behavior policies, and\nthen use PDIS to evaluate their respective evaluation policies.\nFigure 2 compares the MSE (log scale) of the PDIS estimators for the evaluation policies.\n\nFigure 2: Comparing the MSE of PDIS between prim-\nitive and options-based behavior and evaluation policy\npairs. Note the y-axis is a log scale. Our results show\nthat PDIS for the options-based evaluation policies are\nan order of magnitude better than PDIS for the primi-\ntive evaluation policy. Indeed, Corollary 1 shows that\nthe n-step options policies are effectively reducing the\nhorizon by a factor of n over the primitive policy. As\nexpected, the options-based policies that use 3-step\noptions have the lowest MSE.\n\n6 Going Further with Options\n\nOften options are used to achieve a speci\ufb01c sub-task in a domain. For example in a robot navigation\ntask, there may be an option to navigate to a special \ufb01xed location. However one may realize that\nthere is a faster way to navigate to that location, so one may change that option and try to evaluate the\nnew policy to see whether it is actually better. In this case the old and new option are both always\nable to reach the special location; the only difference is that the new option could get there faster. In\nsuch a case we can further reduce the variance of PDIS. We now formally de\ufb01ne this property.\nDe\ufb01nition 1. Given behavior policy \u00b5b and evaluation policy \u00b5e, an option o is called stationary, if\nthe distribution of the states on which o terminates is always the same for \u00b5b and \u00b5e. The underlying\npolicy for option o can differ for \u00b5b and \u00b5e; only the termination state distribution is important.\n\nA stationary option may not always arise due to solving a sub-task. It can also be the case that a\nstationary option is used as a way to perform a soft reset. For example, a robotic manipulation task\nmay want to reset arm and hand joints to a default con\ufb01guration in order to minimize sensor/motor\nerror, before trying to grasp a new object.\nStationary options allows us to point to a step in a trajectory where we know the state distribution\nis \ufb01xed. Because the state distribution is \ufb01xed, we can partition the trajectory into two parts. The\nbeginning of the second partition would then have state distribution that is independent of the actions\n\n2We have also tried using more standard options that navigate to a speci\ufb01c destination, and the experiment\n\nresults closely mirror those shown here.\n\n5\n\n\fchosen in the \ufb01rst partition. We can then independently apply PDIS to each partition, and sum up the\nestimates. This is powerful because it can halve the effective horizon of the problem.\nTheorem 4. Let \u00b5b be an options-based behavior policy. Let \u00b5e be an options-based evaluation\npolicy. Let O be the set of options that \u00b5b and \u00b5e use. The underlying policies of the options in \u00b5e\nmay be arbitrarily different from \u00b5b.\nLet o1 be a stationary option. We can decompose the expected value as follows. Let \u03c41 be the \ufb01rst part\nof a trajectory up until and including the \ufb01rst occurrence of o1. Let \u03c42 be the part of the trajectory\nafter the \ufb01rst occurrence of o1 up to and including the \ufb01rst occurrence of o2. Then\n\nE\u00b5e (G(\u03c4 )) = E\u00b5b (P DIS(\u03c4 )) = E\u00b5b (P DIS(\u03c41)) + E\u00b5b (P DIS(\u03c42))\n\n(5)\n\nProof. See appendix.\n\nNote that there are no conditions on how the probabilities over options may differ, nor on how the\nunderlying policies of the non-stationary options may differ. This means that, regardless of these\ndifferences, the trajectories can be partitioned and PDIS can be independently applied. Furthermore,\nTheorem 3 can still be applied to each of the independent applications of PDIS. Combining Theorem\n4 and Theorem 3 can lead to more ways of designing a desired evaluation policy that will result in a\nlow variance PDIS estimate.\n\n7 Experiment 2: Stationary Options\n\nWe now demonstrate Theorem 4 empirically on NoisyTaxi. In NoisyTaxi, we know that a primitive\n\u0001-greedy policy will eventually pick up the passenger (though it may take a very long time depending\non \u0001). Since the starting location of the passenger is uniformly random, the location of the taxi\nimmediately after picking up the passenger is also uniformly random, but over the four pickup\nlocations. This implies that, regardless of the \u0001 value in an \u0001-greedy policy, we can view executing\nthat \u0001-greedy policy until the passenger is picked up as a new \"PickUp-\u0001\" option that always terminates\nin the same state distribution.\nGiven this argument, we can use Theorem 4 to decompose any NoisyTaxi trajectory into the part\nbefore the passenger is picked up, and the part after the passenger is picked up, estimate the expected\nreward for each, and then sum. As picking up the passenger is often the halfway point in a trajectory\n(depending on the locations of the passenger and the destination), we can perform importance\nsampling over two, approximately half length, trajectories. More concretely, we consider two n = 1\noptions (e.g. primitive action) \u0001-greedy policies. Like in the prior subsection, the behavior policy has\n\u0001 = 0.3 and the evaluation policy has \u0001 = 0.05. We compare performing normal PDIS to estimate the\nvalue of the evaluation policy to estimating it using partitioned-PDIS using Theorem 4. See Figure 3\nfor results.\n\nFigure 3: Comparing MSE of Normal PDIS and PDIS\nthat uses Theorem 4. We gain an order of an order\nof magnitude reduction in MSE (labeled Partitioned-\nPDIS). Note this did not require that the primitive\npolicy used options: we merely used the fact that if\nthere are subgoals in the domain where the agent is\nlikely to go through with a \ufb01xed state distribution, we\ncan leverage that to decompose the value of a long\nhorizon into the sum over multiple shorter ones. Op-\ntions is one common way this will occur, but as we\nsee in this example, this can also occur in other ways.\n\n8 Covariance Testing\n\nThe special case of stationary options can be viewed as a form of dropping certain importance\nsampling weights from the importance sampling estimator. With stationary options, the weights\nbefore the stationary options are dropped when estimating the rewards thereafter. By considering\n\n6\n\n\fthe bias incurred when dropping weights, we derive a general rule involving covariances as follows.\nLet W1W2r be the ordinary importance sampling estimator for reward r where the product of the\nimportance sampling weights are partitioned into two products W1 and W2 using some general\npartitioning scheme such that E(W1) = 1. Note that this condition is satis\ufb01ed when W1, W2 are\nchosen according to commonly used schemes such as \ufb01xed timesteps (not necessarily consecutive) or\n\ufb01xed states, but can be satis\ufb01ed by more general schemes as well. Then we can consider dropping\nthe product of weights W1 and simply output the estimate W2r:\n\nE(W1W2r) = E(W1)E(W2r) + Cov(W1, W2r)\n\n(6)\n(7)\nThis means that if Cov(W1, W2r) = 0, then we can drop the weights W1 with no bias. Otherwise,\nthe bias incurred is Cov(W1, W2r). Then we are free to choose W1, W2 to balance the reduction in\nvariance and the increase in bias.\n\n= E(W2r) + Cov(W1, W2r)\n\n8.1\n\nIncremental Importance Sampling (INCRIS)\n\nUsing the Covariance Test (eqn 7) idea, we propose a new importance sampling algorithm called\nIncremental Importance Sampling (INCRIS). This is a variant of PDIS where for a reward rt, we try\nto drop all but the most recent k importance sampling weights, using the covariance test to optimize\nk in order to lower MSE.\nLet \u03c0b and \u03c0e be the behavior and evaluation policies respectively (they may or may not be options-\nbased policies). Let D = {\u03c4 (1), \u03c4 (2), . . . , \u03c4 (n)} be our historical data set generated from \u03c0b with\nn trajectories of length H. Let \u03c1t = \u03c0e(at|st)\nt be the same but computed from the i-th\ntrajectory. Suppose we are given estimators for covariance and variance. See algorithm 1 for details.\n\n\u03c0b(at|st) . Let \u03c1(i)\n\nAlgorithm 1 INCRIS\n1: Input: D\n2: for t = 1 to H do\nfor k = 0 to t do\n3:\n4:\nj=1 \u03c1j\n5:\nj=t\u2212k+1 \u03c1j\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n\nAk =(cid:81)t\u2212k\nBk =(cid:81)t\nEstimate Cov(Ak, Bkrt) and denote (cid:98)Ck\nEstimate Var(Bkrt) and denote (cid:98)Vk\nEstimate MSE with (cid:92)M SEk = (cid:98)C 2\nk +(cid:98)Vk\n(cid:80)n\nLet (cid:98)rt = 1\n13: Return(cid:80)H\nt=1(cid:98)rt\n\nend for\nk(cid:48) = argmink\n\n(cid:92)M SEk\ni=1 B(i)\nk rt\n\nn\n\n8.2 Strong Consistency\n\nIn the appendix, we provide a proof that INCRIS is strongly consistent. We now give a brief intuition\nfor the proof. As n goes to in\ufb01nity, the estimates for the MSE get better and better and converge\nto the bias. We know that if we do not drop any weights, we get an unbiased estimate and thus the\nsmallest MSE estimate will go to zero. Thus we get more and more likely to pick k that correspond\nto unbiased estimates.\n\n9 Experiment 3: Incremental Importance Sampling\n\nTo evaluate INCRIS, we constructed a simple MDP that exempli\ufb01es to properties of domains for which\nwe expect INCRIS to be useful. Speci\ufb01cally, we were motivated by the applications of reinforcement\nlearning methods to type 1 diabetes treatments [1, 17] and digital marketing applications [15]. In\nthese applications there is a natural place where one might divide data into episodes: for type 1\n\n7\n\n\fdiabetes treatment, one might treat each day as an independent episode, and for digital marketing,\none might treat each user interaction as an independent episode.\nHowever, each day is not actually independent in diabetes treatment\u2014a person\u2019s blood sugar in the\nmorning depends on their blood sugar at the end of the previous day. Similarly, in digital marketing\napplications, whether or not a person clicks on an ad might depend on which ads they were shown\npreviously (e.g., someone might be less likely to click an ad that they were shown before and did not\nclick on then). So, although this division into episodes is reasonable, it does not result in episodes\nthat are completely independent, and so importance sampling will not produce consistent estimates\n(or estimates that can be trusted for high-con\ufb01dence off-policy policy evaluation [18]). To remedy\nthis, we might treat all of the data from a single individual (many days, and many page visits) as a\nsingle episode, which contains nearly-independent subsequences of decisions.\nTo model this property, we constructed an MDP with three states, s1, s2, and s3 and two actions, a1\nand a2. The agent always begins in s1, where taking action a1 causes a transition to s2 with a reward\nof +1 and taking action a2 causes a transition to s3 with a reward of \u22121. In s2, both actions lead to\na terminal absorbing state with reward \u22122 + \u0001, and in s3 both actions lead to a terminal absorbing\nstate with reward +2. For now, let \u0001 = 0. This simple MDP has a horizon of 2 time steps\u2014after two\nactions the agent is always in a terminal absorbing state. To model the aforementioned examples,\nwe modi\ufb01ed this simple MDP so that whenever the agent would transition to the terminal absorbing\nstate, it instead transitions back to s1. After visiting s1 \ufb01fty times, the agent \ufb01nally transitions to a\nterminal absorbing state. Furthermore, to model the property that the \ufb01fty sub-episodes within the\nlarger episode are not completely independent, we set \u0001 = 0 initially, and \u0001 = \u0001 + 0.01 whenever the\nagent enters s2. This creates a slight dependence across the sub-episodes.\nFor this illustrative domain, we would like an importance sampling estimator that assumes that sub-\nepisodes are independent when there is little data in order to reduce variance. However, once there is\nenough data for the variances of estimates to be suf\ufb01ciently small relative to the bias introduced by\nassuming that sub-episodes are independent, the importance sampling estimator should automatically\nbegin considering longer sequences of actions, as INCRIS does. We compared INCRIS to ordinary\nimportance sampling (IS), per-decision importance sampling (PDIS), weighted importance sampling\n(WIS), and consistent weighted per-decision importance sampling (CWPDIS). The behavior policy\nselects actions randomly, while the evaluation policy selects action a1 with a higher probability than\na2. In Figure 4 we report the mean squared errors of the different estimators using different amounts\nof data.\n\nFigure 4: Performance of different estimators on the simple\nMDP that models properties of the diabetes treatment and\ndigital marketing applications. The reported mean squared\nerrors are the sample mean squared errors from 128 trials.\nNotice that INCRIS achieves an order of magnitude lower\nmean squared error than all of the other estimators, and for\nsome n it achieves two orders of magnitude improvement\nover the unweighted importance sampling estimators.\n\n10 Conclusion\n\nWe have shown that using options-based behavior and evaluation policies allow for lower mean\nsquared error when using importance sampling due to their structure. Furthermore, special cases may\nnaturally arise when using options, such as when options terminate in a \ufb01xed state distribution, and\nlead to greater reduction of the mean squared error.\nWe examined options as a \ufb01rst step, but in the future these results may be extended to full hierarchical\npolicies (like the MAX-Q framework). We also generalized naturally occurring special cases with\ncovariance testing that leads to dropping out weights in order to improve importance sampling\npredictions. We showed an instance of covariance testing in the algorithm INCRIS, which can greatly\nimprove estimation accuracy for a general class of domains, and hope to derive more powerful\nestimators based on covariance testing that can apply to even more domains in the future.\n\n8\n\n110100100010000100000110100100010000MSEAmount of Historical Data, nISPDISWISCWPDISINCRIS\fAcknowledgements\n\nThe research reported here was supported in part by an ONR Young Investigator award, an NSF\nCAREER award, and by the Institute of Education Sciences, U.S. Department of Education. The\nopinions expressed are those of the authors and do not represent views of NSF, IES or the U.S. Dept.\nof Education.\n\nReferences\n[1] M. Bastani. Model-free intelligent diabetes management using machine learning. Master\u2019s\n\nthesis, Department of Computing Science, University of Alberta, 2014.\n\n[2] Emma Brunskill and Lihong Li. Pac-inspired option discovery in lifelong reinforcement learning.\n\nIn ICML, pages 316\u2013324, 2014.\n\n[3] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function\n\ndecomposition. J. Artif. Intell. Res.(JAIR), 13:227\u2013303, 2000.\n\n[4] M. Dud\u00edk, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In Proceedings\nof the Twenty-Eighth International Conference on Machine Learning, pages 1097\u20131104, 2011.\n[5] Assaf Hallak, Fran\u00e7ois Schnitzler, Timothy Arthur Mann, and Shie Mannor. Off-policy model-\n\nbased learning under unknown factored dynamics. In ICML, pages 711\u2013719, 2015.\n\n[6] Assaf Hallak, Aviv Tamar, R\u00e9mi Munos, and Shie Mannor. Generalized emphatic temporal\n\ndifference learning: Bias-variance analysis. arXiv preprint arXiv:1509.05172, 2015.\n\n[7] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning.\nIn Proceedings of The 33rd International Conference on Machine Learning, pages 652\u2013661,\n2016.\n\n[8] Lihong Li, Remi Munos, and Csaba Szepesvari. Toward Minimax Off-policy Value Estimation.\nIn Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of Machine\nLearning Research, pages 608\u2013616, San Diego, California, USA, 09\u201312 May 2015. PMLR.\nURL http://proceedings.mlr.press/v38/li15b.html.\n\n[9] Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Time regularized interrupting options.\n\nIn Internation Conference on Machine Learning, 2014.\n\n[10] Timothy A Mann and Shie Mannor. The advantage of planning with options. RLDM 2013,\n\npage 9, 2013.\n\n[11] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient\noff-policy reinforcement learning. In Advances in Neural Information Processing Systems,\npages 1046\u20131054, 2016.\n\n[12] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, page 80, 2000.\n\n[13] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning\n\nwith function approximation. In ICML, pages 417\u2013424, 2001.\n\n[14] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPsmdps: A\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):\n181\u2013211, 1999.\n\n[15] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems\nfor life-time value optimization with guarantees. In Proceedings of the International Joint\nConference on Arti\ufb01cial Intelligence, 2015.\n\n[16] P. S. Thomas and E. Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement\n\nlearning. In International Conference on Machine Learning, 2016.\n\n[17] P. S. Thomas and E. Brunskill. Importance sampling with unequal support. AAAI, 2017.\n[18] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence off-policy evaluation.\n\nIn Proceedings of the Twenty-Ninth Conference on Arti\ufb01cial Intelligence, 2015.\n\n9\n\n\f[19] Philip S Thomas, Scott Niekum, Georgios Theocharous, and George Konidaris.\n\nPol-\nicy evaluation using the \u2126-return.\nIn C. Cortes, N. D. Lawrence, D. D. Lee,\nM. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems\n28, pages 334\u2013342. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/\n5807-policy-evaluation-using-the-return.pdf.\n\n10\n\n\f", "award": [], "sourceid": 1452, "authors": [{"given_name": "Zhaohan", "family_name": "Guo", "institution": "Carnegie Mellon University/Stanford"}, {"given_name": "Philip", "family_name": "Thomas", "institution": "CMU"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "Stanford University"}]}