{"title": "A Computational Decision Theory for Interactive Assistants", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 585, "abstract": "We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant's action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution.", "full_text": "A Computational Decision Theory\n\nfor Interactive Assistants\n\nAlan Fern\n\nSchool of EECS\n\nOregon State University\n\nCorvallis, OR 97331\n\nPrasad Tadepalli\nSchool of EECS\n\nOregon State University\n\nCorvallis, OR 97331\n\nafern@eecs.oregonstate.edu\n\ntadepall@eecs.oregonstate.edu\n\nAbstract\n\nWe study several classes of interactive assistants from the points of view of deci-\nsion theory and computational complexity. We \ufb01rst introduce a class of POMDPs\ncalled hidden-goal MDPs (HGMDPs), which formalize the problem of interac-\ntively assisting an agent whose goal is hidden and whose actions are observable.\nIn spite of its restricted nature, we show that optimal action selection in \ufb01nite hori-\nzon HGMDPs is PSPACE-complete even in domains with deterministic dynamics.\nWe then introduce a more restricted model called helper action MDPs (HAMDPs),\nwhere the assistant\u2019s action is accepted by the agent when it is helpful, and can be\neasily ignored by the agent otherwise. We show classes of HAMDPs that are com-\nplete for PSPACE and NP along with a polynomial time class. Furthermore, we\nshow that for general HAMDPs a simple myopic policy achieves a regret, com-\npared to an omniscient assistant, that is bounded by the entropy of the initial goal\ndistribution. A variation of this policy is shown to achieve worst-case regret that\nis logarithmic in the number of goals for any goal distribution.\n\nIntroduction\n\n1\nIntegrating AI with Human Computer Interaction has received signi\ufb01cant attention in recent years [8,\n11, 13, 3, 2]. In most applications, e.g. travel scheduling, information retrieval, or computer desktop\nnavigation, the relevant state of the computer is fully observable, but the goal of the user is not, which\nposes a dif\ufb01cult problem to the computer assistant. The assistant needs to correctly reason about the\nrelative merits of taking different actions in the presence of signi\ufb01cant uncertainty about the goals of\nthe human agent. It might consider taking actions that directly reveal the goal of the agent, e.g. by\nasking questions to the user. However, direct communication is often dif\ufb01cult due to the language\nmismatch between the human and the computer. Another strategy is to take actions that help achieve\nthe most likely goals. Yet another strategy is to take actions that help with a large number of possible\ngoals. In this paper, we formulate and study several classes of interactive assistant problems from\nthe points of view of decision theory and computational complexity. Building on the framework\nof decision-theoretic assistance (DTA) [5], we analyze the inherent computational complexity of\noptimal assistance in a variety of settings and the sources of that complexity. Positively, we analyze\na simple myopic heuristic and show that it performs nearly optimally in a reasonably pervasive\nassistance problem, thus explaining some of the positive empirical results of [5].\nWe formulate the problem of optimal assistance as solving a hidden-goal MDP (HGMDP), which\nis a special case of a POMDP [6]. In a HGMDP, a (human) agent and a (computer) assistant take\nactions in turns. The agent\u2019s goal is the only unobservable part of the state of the system and does\nnot change throughout the episode. The objective for the assistant is to \ufb01nd a history-dependent\npolicy that maximizes the expected reward of the agent given the agent\u2019s goal-based policy and its\ngoal distribution. Despite the restricted nature of HGMDPs, the complexity of determining if an\nHGMDP has a \ufb01nite-horizon policy of a given value is PSPACE-complete even in deterministic\n\n1\n\n\fenvironments. This motivates a more restricted model called Helper Action MDP (HAMDP), where\nthe assistant executes a helper action at each step. The agent is obliged to accept the helper action\nif it is helpful for its goal and receives a reward bonus (or cost reduction) for doing so. Otherwise,\nthe agent can continue with its own preferred action without any reward or penalty to the assistant.\nWe show classes of this problem that are complete for PSPACE and NP. We also show that for the\nclass of HAMDPs with deterministic agents there are polynomial time algorithms for minimizing\nthe expected and worst-case regret relative to an omniscient assistant. Further, we show that the\noptimal worst case regret can be characterized by a graph-theoretic property called the tree rank of\nthe corresponding all-goals policy tree and can be computed in linear time.\nThe main positive result of the paper is to give a simple myopic policy for general stochastic\nHAMDPs that has a regret which is upper bounded by the entropy of the goal distribution. Fur-\nthermore we give a variant of this policy that is able to achieve worst-case and expected regret that\nis logarithmic in the number of goals without any prior knowledge of the goal distribution.\nTo the best of our knowledge, this is the \ufb01rst formal study of the computational hardness of the prob-\nlem of decision-theoretically optimal assistance and the performance of myopic heuristics. While\nthe current HAMDP results are con\ufb01ned to unobtrusively assisting a competent agent, they provide\na strong foundation for analyzing more complex classes of assistant problems, possibly including\ndirect communication, coordination, partial observability, and irrationality of users.\n\n2 Hidden Goal MDPs\n\nThroughout the paper we will refer to the entity that we are attempting to assist as the agent and\nthe assisting entity as the assistant. Our objective is to select actions for the assistant in order to\nhelp the agent maximize its reward. The key complication is that the agent\u2019s goal is not directly\nobservable to the assistant, so reasoning about the likelihood of possible goals and how to help\nmaximize reward given those goals is required. In order to support this type of reasoning we will\nmodel the agent-assistant process via hidden goal MDPs (HGMDPs).\nGeneral Model. An HGMDP describes the dynamics and reward structure of the environment\nvia a \ufb01rst-order Markov model, where it is assumed that the state is fully observable to both\nthe agent and assistant.\nIn addition, an HGMDP describes the possible goals of the agent and\nthe behavior of the agent when pursuing those goals. More formally, an HGMDP is a tuple\n(cid:104)S, G, A, A(cid:48), T, R, \u03c0, IS, IG(cid:105) where S is a set of states, G is a \ufb01nite set of possible agent goals,\nA is the set of agent actions, A(cid:48) is the set of assistant actions, T is the transition function such that\nT (s, g, a, s(cid:48)) is the probability of a transition to state s(cid:48) from s after taking action a \u2208 A \u222a A(cid:48) when\nthe agent goal is g, R is the reward function which maps S \u00d7 G\u00d7 (A\u222a A(cid:48)) to real valued rewards, \u03c0\nis the agent\u2019s policy that maps S \u00d7 G to distributions over A and need not be optimal in any sense,\nand IS (IG) is an initial state (goal) distribution. The dependence of the reward and policy on the\ngoal allows the model to capture the agent\u2019s desires and behavior under each goal. The dependence\nof T on the goal is less intuitive and in many cases there will be no dependence when T is used only\nto model the dynamics of the environment. However, we allow goal dependence of T for generality\nof modeling. For example, it can be convenient to model basic communication actions of the agent\nas changing aspects of the state, and the result of such actions will often be goal dependent.\nWe consider a \ufb01nite-horizon episodic problem setting where the agent begins each episode in a state\ndrawn from IS with a goal drawn from IG. The goal, for example, might correspond to a physical\nlocation, a dish that the agent wants to cook, or a destination folder on a computer desktop. The\nprocess then alternates between the agent and assistant executing actions (including noops) in the\nenvironment until the horizon is reached. The agent is assumed to select actions according to \u03c0. In\nmany domains, a terminal goal state will be reached within the horizon, though in general, goals can\nhave arbitrary impact on the reward function. The reward for the episode is equal to the sum of the\nrewards of the actions executed by the agent and assistant during the episode. The objective of the\nassistant is to reason about the HGMDP and observed state-action history in order to select actions\nthat maximize the expected (or worst-case) total reward of an episode.\nAn example HGMDP from previous work [5] is the doorman domain, where an agent navigates a\ngrid world in order to arrive at certain goal locations. To move from one location to another the\nagent must open a door and then walk through the door. The assistant can reduce the effort for the\nagent by opening the relevant doors for the agent. Another example from [1] involves a computer\n\n2\n\n\fdesktop where the agent wishes to navigate to certain folders using a mouse. The assistant can select\nactions that offer the agent a small number of shortcuts through the folder structure.\nGiven knowledge of the agent\u2019s goal g in an HGMDP, the assistant\u2019s problem reduces to solving\nan MDP over assistant actions. The MDP transition function captures both the state change due to\nthe assistant action and also the ensuing state change due to the agent action selected according to\nthe policy \u03c0 given g. Likewise the reward function on a transition captures the reward due to the\nassistant action and the ensuing agent action conditioned on g. The optimal policy for this MDP\ncorresponds to an optimal assistant policy for g. However, since the real assistant will often have\nuncertainty about the agent\u2019s goal, it is unlikely that this optimal performance will be achieved.\nComputational Complexity. We can view an HGMDP as a collection of |G| MDPs that share the\nsame state space, where the assistant is placed in one of the MDPs at the beginning of each episode,\nbut cannot observe which one. Each MDP is the result of \ufb01xing the goal component of the HG-\nMDP de\ufb01nition to one of the goals. This collection can be easily modeled as a restricted type of\npartially observable MDP (POMDP) with a state space S \u00d7 G. The S component is completely ob-\nservable, while the G component is unobservable but only changes at the beginning of each episode\n(according to IG) and remains constant throughout an episode. Furthermore, each POMDP tran-\nsition provides observations of the agent action, which gives direct evidence about the unchanging\nG component. From this perspective HGMDPs appear to be a signi\ufb01cant restriction over general\nPOMDPs. However, our \ufb01rst result shows that despite this restriction the worst-case complexity is\nnot reduced even for deterministic dynamics.\nGiven an HGMDP M, a horizon m = O(|M|) where |M| is the size of the encoding of M, and a\nreward target r\u2217, the short-term reward maximization problem asks whether there exists a history-\ndependent assistant policy that achieves an expected \ufb01nite horizon reward of at least r\u2217. For general\nPOMDPs this problem is PSPACE-complete [12, 10], and for POMDPs with deterministic dynam-\nics, it is NP-complete [9]. However, we have the following result.\nTheorem 1. Short-term reward maximization for HGMDPs with deterministic dynamics is\nPSPACE-complete.\n\nThe proof is in the appendix. This result shows that any POMDP can be encoded as an HGMDP\nwith deterministic dynamics, where the stochastic dynamics of the POMDP are captured via the\nstochastic agent policy in the HGMDP. However, the HGMDPs resulting from the PSPACE-hardness\nreduction are quite pathological compared to those that are likely to arise in practice. Most impor-\ntantly, the agent\u2019s actions provide practically no information about the agent\u2019s goal until the end of\nan episode, when it is too late to exploit this knowledge. This suggests that we search for restricted\nclasses of HGMDPs that will allow for ef\ufb01cient solutions with performance guarantees.\n\n3 Helper Action MDPs\nThe motivation for HAMDPs is to place restrictions on the agent and assistant that avoid the fol-\nlowing three complexities that arise in general HGMDPs: 1) the agent can behave arbitrarily poorly\nif left unassisted and as such the agent actions may not provide signi\ufb01cant evidence about the goal;\n2) the agent is free to effectively \u201cignore\u201d the assistant\u2019s help and not exploit the results of assistive\naction, even when doing so would be bene\ufb01cial; and 3) the assistant actions have the possibility of\nnegatively impacting the agent compared to not having an assistant. HAMDPs will address the \ufb01rst\nissue by assuming that the agent is competent at (approximately) maximizing reward without the\nassistant. The last two issues will be addressed by assuming that the agent will always \u201cdetect and\nexploit\u201d helpful actions and that the assistant actions do not hurt the agent.\nInformally, the HAMDP provides the assistant with a helper action for each of the agent\u2019s actions.\nWhenever a helper action h is executed directly before the corresponding agent action a, the agent\nreceives a bonus reward of 1. However, the agent will only accept the helper action h (by taking a)\nand hence receive the bonus, if a is an action that the agent considers to be good for achieving the\ngoal without the assistant. Thus, the primary objective of the assistant in an HAMDP is to maximize\nthe number of helper actions that get accepted by the agent. While simple, this model captures much\nof the essence of assistance domains where assistant actions cause minimal harm and the agent is\nable to detect and accept good assistance when it arises.\nAn HAMDP is an HGMDP (cid:104)S, G, A, A(cid:48), T, R, \u03c0, IS, IG(cid:105) with the following constraints:\n\n3\n\n\fthat for each ai there is a corresponding helper action hi.\n\nencode the current world state and the previous assistant action.\n\n\u2022 The agent and the assistant actions sets are A = {a1, . . . , an} and A(cid:48) = {h1, . . . , hn}, so\n\u2022 The state space is S = W \u222a (W \u00d7 A(cid:48)), where W is a set of world states. States in W \u00d7 A(cid:48)\n\u2022 The reward function R is 0 for all assistant actions. For agent actions the reward is zero\nunless the agent selects the action ai in state (s, hi) which gives a reward of 1. That is, the\nagent receives a bonus of 1 whenever its action corresponds to the preceding helper action.\n\u2022 The assistant always acts from states in W , and T is such that taking hi in s deterministi-\ncally transitions to (s, hi).\n\u2022 The agent always acts from states in S\u00d7A(cid:48), resulting in states in S according to a transition\nfunction that does not depend on hi, i.e. T ((s, hi), g, ai, s(cid:48)) = T (cid:48)(s, g, ai, s(cid:48)) for some\ntransition function T (cid:48).\n\u2022 Finally, for the agent policy, let \u03a0(s, g) be a function that returns a set of actions and P (s, g)\nbe a distribution over those actions. We will view \u03a0(s, g) as the set of actions that the agent\nconsiders acceptable (or equally good) in state s for goal g. The agent policy \u03c0 always\nselects ai after its helper action hi whenever ai is acceptable. That is, \u03c0((s, hi), g) = ai\nwhenever ai \u2208 \u03a0(s, g). Otherwise the agent draws an action according to P (s, g).\n\nIn a HAMDP, the primary impact of an assistant action is to in\ufb02uence the reward of the following\nagent action. The only rewards in HAMDPS are the bonuses received whenever the agent accepts a\nhelper action. Any additional environmental reward is assumed to be already captured by the agent\npolicy via \u03a0(s, g) that contains actions that approximately optimize this reward.\nThe HAMDP model can be adapted to both the doorman domain in [5] and the folder prediction\ndomain from [1]. In the doorman domain, the helper actions correspond to opening doors for the\nagent, which reduce the cost of navigating from one room to another.\nImportantly opening an\nincorrect door has a \ufb01xed reward loss compared to an optimal assistant, which is a key property\nof HAMDPs. In the folder prediction domain, the system proposes multiple folders to save a \ufb01le,\npotentially saving the user a few clicks every time the proposal is accepted.\nDespite the apparent simpli\ufb01cation of HAMDPs over HGMDPs, somewhat surprisingly the worst\ncase computational complexity is not reduced.\nTheorem 2. Short-term reward maximization for HAMDPs is PSPACE-complete.\n\nThe proof is in the appendix. Unlike the case of HGMDPs, we will see that the stochastic dynamics\nare essential for PSPACE-hardness. Despite this negative result, the following sections show the\nutility of the HAMDP restriction by giving performance guarantees for simple policies and improved\ncomplexity results in special cases. So far, there are no analogous results for HGMDPs.\n\n4 Regret Analysis for HAMDPs\nGiven an assistant policy \u03c0(cid:48), the regret of a particular episode is the extra reward that an omniscient\nassistant with knowledge of the goal would achieve over \u03c0(cid:48). For HAMDPs the omniscient assistant\ncan always achieve a reward equal to the \ufb01nite horizon m, because it can always select a helper action\nthat will be accepted by the agent. Thus, the regret of an execution of \u03c0(cid:48) in a HAMDP is equal to\nthe number of helper actions that are not accepted by the agent, which we will call mispredictions.\nFrom above we know that optimizing regret is PSPACE-hard and thus here we focus on bounding\nthe expected and worst-case regret of the assistant. We now show that a simple myopic policy is\nable to achieve regret bounds that are logarithmic in the number of goals.\nMyopic Policy. Intuitively, our myopic assistant policy \u02c6\u03c0 will select an action that has the highest\nprobability of being accepted with respect to a \u201ccoarsened\u201d version of the posterior distribution over\ngoals. The myopic policy in state s given history H is based on the consistent goal set C(H), which\nis the set of goals that have non-zero probability with respect to history H. It is straightforward to\nmaintain C(H) after each observation. The myopic policy is de\ufb01ned as:\nIG(C(H) \u2229 G(s, a))\n\n\u02c6\u03c0(s, H) = arg max\n\nwhere G(s, a) = {g | a \u2208 \u03a0(s, g)} is the set of goals for which the agent considers a to be an\nacceptable action in state s. The expression IG(C(H) \u2229 G(s, a)) can be viewed as the probability\n\na\n\n4\n\n\fmass of G(s, a) under a coarsened goal posterior which assigns goals outside of C(H) probability\nzero and otherwise weighs them proportional to the prior.\nTheorem 3. For any HAMDP the expected regret of the myopic policy is bounded above by the\nentropy of the goal distribution H(IG).\n\nProof. The main idea of the proof is to show that after each misprediction of the myopic policy (i.e.\nthe selected helper action is not accepted by the agent) the uncertainty about the goal is reduced by\na constant factor, which will allow us to bound the total number of mispredictions on any trajectory.\nConsider a misprediction step where the myopic policy selects helper action hi in state s given his-\ntory H, but the agent does not accept the action and instead selects a\u2217 (cid:54)= ai. By the de\ufb01nition of the\nmyopic policy we know that IG(C(H) \u2229 G(s, ai)) \u2265 IG(C(H) \u2229 G(s, a\u2217)), since otherwise the\nassistant would not have chosen hi. From this fact we now argue that IG(C(H(cid:48))) \u2264 IG(C(H))/2\nwhere H(cid:48) is the history after the misprediction. That is, the probability mass under IG of the con-\nsistent goal set after the misprediction is less than half that of the consistent goal set before the\nmisprediction. To show this we will consider two cases: 1) IG(C(H) \u2229 G(s, ai)) < IG(C(H))/2,\nand 2) IG(C(H) \u2229 G(s, ai)) \u2265 IG(C(H))/2.\nIn the \ufb01rst case, we immediately get that\nIG(C(H)\u2229G(s, a\u2217)) < IG(C(H))/2. Combining this with the fact that C(H(cid:48)) \u2286 C(H)\u2229G(s, a\u2217)\nwe get the desired result that IG(C(H(cid:48))) \u2264 IG(C(H))/2. In the second case, note that\nC(H(cid:48)) \u2286 C(H) \u2229 (G(s, a\u2217) \u2212 G(s, ai)) \u2286 C(H) \u2212 (C(H) \u2229 G(s, ai))\n\nCombining this with our assumption for the second case implies that IG(C(H(cid:48))) \u2264 IG(C(H))/2.\nThis implies that for any episode, after n mispredictions resulting in a history Hn, IG(C(Hn)) \u2264\n2\u2212n. Now consider an arbitrary episode where the true goal is g. We know that IG(g) is a lower\nbound on IG(C(Hn)), which implies that IG(g) \u2264 2\u2212n or equivalently that n \u2264 \u2212 log(IG(g)).\nThus for any episode with goal g the maximum number of mistakes is bounded by \u2212 log(IG(g)).\nUsing this fact we get that the expected number of mispredictions during an episode with respect to\n\nIG is bounded above by \u2212(cid:80)\n\ng IG(g) log(IG(g)) = H(IG), which completes the proof.\n\nSince H(IG) \u2264 log(|G|), this result implies that for HAMDPs the expected regret of the myopic\npolicy is no more than logarithmic in the number of goals. Furthermore, as the uncertainty about\nthe goal decreases (decreasing H(IG)) the regret bound improves until we get a regret of 0 when IG\nputs all mass on a single goal. This logarithmic bound is asymptotically tight in the worst case.\nTheorem 4. There exists a HAMDP such that for any assistant policy the expected regret is at least\nlog(|G|)/2.\n\nProof. Consider a deterministic HAMDP such that the environment is structured as a binary tree\nof depth log(|G|), where each leaf corresponds to one of the |G| goals. By considering a uniform\ngoal distribution it is easy to verify that at any node in the tree there is an equal chance that the true\ngoal is in the left or right sub-tree during any episode. Thus, any policy will have a 0.5 chance of\ncommitting a misprediction at each step of an episode. Since each episode is of length log(|G|), the\nexpected regret of an episode for any policy is log(|G|)/2.\n\nResolving the gap between the myopic policy bound and this regret lower bound is an open problem.\nApproximate Goal Distributions. Suppose that the assistant uses an approximate goal distribution\nI(cid:48)\nG instead of the true underlying goal distribution IG when computing the myopic policy. That\nG(C(H) \u2229 G(s, a)), which we will refer to as the\nis, the assistant selects actions that maximize I(cid:48)\nmyopic policy relative to I(cid:48)\nG instead of IG can be bounded in terms\nof the KL-divergence between these distributions KL(IG (cid:107) I(cid:48)\nG equals IG.\nTheorem 5. For any HAMDP with goal distribution IG, the expected regret of the myopic policy\nwith respect to distribution I(cid:48)\n\nG is bounded above by H(IG) + KL(IG (cid:107) I(cid:48)\n\nG. The extra regret for using I(cid:48)\n\nG), which is zero when I(cid:48)\n\nG).\n\nThe proof is in the appendix. Deriving similar results for other approximations is an open problem.\nA consequence of Theorem 5 is that the myopic policy with respect to the uniform goal distribution\nhas expected regret bounded by log(|G|) for any HAMDP, showing that logarithmic regret can be\nachieved without knowledge of IG. This can be strengthened to hold for worst case regret.\n\n5\n\n\fTheorem 6. For any HAMDP, the worst case and hence expected regret of the myopic policy with\nrespect to the uniform goal distribution is bounded above by log(|G|).\n\nProof. The proof of Theorem 5 shows that the number of mispredictions on any episode is bounded\nG = 1/|G| which shows a worst case regret bound of log(|G|),\nabove by \u2212 log(I(cid:48)\nwhich also bounds the expected regret of the uniform myopic policy.\n\nG). In our case I(cid:48)\n\n5 Deterministic and Bounded Choice Policies\nWe now consider several special cases of HAMDPs. First, we restrict the agent\u2019s policy to be\ndeterministic for each goal, i.e. \u03a0(s, g) has at most a single action for each state-goal pair (s, g).\nTheorem 7. The myopic policy achieves the optimal expected reward for HAMDPs with determin-\nistic agent policies.\n\nThe proof is given in the appendix. We now consider the case where both the agent policy and\nthe environment are deterministic, and attempt to minimize the worst possible regret compared to\nan omniscient assistant who knows the agent\u2019s goal. As it happens, this \u201cminimax policy\u201d can be\ncaptured by a graph-theoretic notion of tree rank that generalizes the rank of decision trees [4].\nDe\ufb01nition 1. The rank of a rooted tree is the rank of its root node.\nIf a node is a leaf node\nthen rank(node) = 0, else if a node has at least two distinct children c1 and c2 with equal\nhighest ranks among all children, then rank(node) = 1+ rank(c1). Otherwise rank(node) =\nrank of the highest ranked child.\n\nThe optimal trajectory tree (OTT) of a HAMDP in deterministic environments is a tree where the\nnodes represent the states of the HAMDP reached by the pre\ufb01xes of optimal action sequences for\ndifferent goals starting from the initial state.1 Each node in the tree represents a state and a set of\ngoals for which it is on the optimal path from the initial state.\nSince the agent policy and the environment are both deterministic, there is at most one trajectory per\ngoal in the tree. Hence the size of the optimal trajectory tree is bounded by the number of goals times\nthe maximum length of any trajectory, which is at most the size of the state space in deterministic\ndomains. The following Lemma follows by induction on the depth of the optimal trajectory tree.\nLemma 1. The minimum worst-case regret of any policy for an HAMDP for deterministic envi-\nronments and deterministic agent policies is equal to the tree rank of its optimal trajectory tree.\n\nTheorem 8. If the agent policy is deterministic, the problem of minimizing the maximum regret in\nHAMDPs in deterministic environments is in P.\n\nProof. We \ufb01rst construct the optimal trajectory tree. We then compute its rank and the optimal\nminimax policy using the recursive de\ufb01nition of tree rank in linear time.\n\nThe assumption of deterministic agent policy may be too restrictive in many domains. We now\nconsider HAMDPs in which the agent policies have a constant bound on the number of possible\nactions in \u03a0(s, g) for each state-goal pair. We call them bounded choice HAMDPs.\nDe\ufb01nition 2. The branching factor of a HAMDP is the largest number of possible actions in \u03a0(s, g)\nby the agent in any state for any goal and any assistant\u2019s action.\n\nThe doorman domain of [5] has a branching factor of 2 since there are at most two optimal actions\nto reach any goal from any state.\nTheorem 9. Minimizing the worst-case regret in \ufb01nite horizon bounded choice HAMDPS of a con-\nstant branching factor k \u2265 2 in deterministic environments is NP-complete.\n\nThe proof is in the appendix. We can also show that minimizing the expected regret for a bounded\nk is NP-hard. We conjecture that this problem is also in NP, but this question remains open.\n\n1If there are multiple initial states, we build an OTT for each initial state. Then the rank would be the\n\nmaximum of the ranks of all trees.\n\n6\n\n\f6 Conclusions and Future Work\n\nIn this paper, we formulated the problem of optimal assistance and analyzed its complexity in mul-\ntiple settings. We showed that the general problem of HGMDP is PSPACE-complete due to the lack\nof constraints on the user, who can behave stochastically or even adversarially with respect to the\nassistant, which makes the assistant\u2019s task very dif\ufb01cult. By suitably constraining the user\u2019s actions\nthrough HAMDPs, we are able to reduce the complexity to NP-complete, but only in deterministic\nenvironments with bounded choice agents. More encouragingly, we are able to show that HAMDPs\nare amenable to a simple myopic heuristic which has a regret bounded by the entropy of the goal\ndistribution when compared to the omniscient assistant. This is a satisfying result since optimal\ncommunication of the goal requires as much information to pass from the agent to the assistant. Im-\nportantly, this result applies to stochastic as well as deterministic environments and with no bound\non the number of agent\u2019s action choices.\nAlthough HAMDPs are somewhat restricted compared to possible assistantship scenarios one could\nimagine, they in fact \ufb01t naturally to many domains where the user is on-line, knows which helper\nactions are acceptable, and accepts help when it is appropriate to the goal. Indeed, in many domains,\nit is reasonable to constrain the assistant so that the agent has the \ufb01nal say on approving the actions\nproposed by the assistant. These scenarios range from the ubiquitous auto-complete functions and\nMicrosoft\u2019s infamous Paperclip to more sophisticated adaptive programs such as SmartEdit [7]\nand TaskTracer [3] that learn assistant policies from users\u2019 long-term behaviors. By analyzing\nthe complexity of these tasks in a more general framework than what is usually done, we shed\nlight on some of the sources of complexity such as the stochasticity of the environment and the\nagent\u2019s policy. Many open problems remain including generalization of these and other results to\nmore general assistant frameworks, including partially observable and adversarial settings, learning\nassistants, and multi-agent assistance.\n\n7 Appendix\n\nProof of Theorem 1. Membership in PSPACE follows from the fact that any HGMDP can be poly-\nnomially encoded as a POMDP for which policy existence is in PSPACE. To show PSPACE-\nhardness, we reduce the QSAT problem to the problem of the existence of a history-dependent\nassistant policy of expected reward \u2265 r.\nLet \u03c6 be a quanti\ufb01ed Boolean formula \u2200x1\u2203x2\u2200x3 . . .\u2203xn {C1(x1, . . . , xn) \u2227 . . . \u2227\nCm(x1, . . . , xn)}, where each Ci is a disjunctive clause. For us, each goal gi is a quanti\ufb01ed clause,\n\u2200x1\u2203x2\u2200x3 . . .\u2203xn {Ci(x1, . . . , xn)}. The agent chooses a goal uniformly randomly from the set\nof goals formed from \u03c6 and hides it from the assistant. The states consist of pairs of the form (v, i),\nwhere v \u2208 {0, 1} is the current value of the goal clause, and i is the next variable to set. The ac-\ntions of the assistant are to set the existentially quanti\ufb01ed variables. The agent simulates setting the\nuniversally quanti\ufb01ed variables by choosing actions from the set {0, 1} with equal probability. The\nepisode terminates when all the variables are set, and the assistant gets a reward of 1 if the value of\nthe clause is 1 at the end and a reward of 0 otherwise.\nNote that the assistant does not get any useful feedback from the agent until it is too late and it\neither makes a mistake or solves the goal. The best the assistant can do is to \ufb01nd an optimal history-\ndependent policy that maximizes the expected reward over the goals in \u03a6. If \u03a6 is satis\ufb01able, then\nthere is an assistant policy that leads to a reward of 1 over all goals and all agent actions, and hence\nhas an expected value of 1 over the goal distribution. If not, then at least one of the goals will not be\nsatis\ufb01ed for some setting of the universal quanti\ufb01ers, leading to an expected value < 1.\n\nProof of Theorem 2. Membership in PSPACE follows easily since HAMDP is a specialization of\nHGMDP. The proof of PSPACE-hardness is identical to that of 1 except that here, instead of the\nagent\u2019s actions, the stochastic environment models the universal quanti\ufb01ers. The agent accepts\nall actions until the last one and sets the variable as suggested by the assistant. After each of the\nassistant\u2019s actions, the environment chooses a value for the universally quanti\ufb01ed variable with equal\nprobability. The last action is accepted by the agent if the goal clause evaluates to 1, otherwise not.\nThere is a history-dependent policy whose expected reward \u2265 the number of existential variables if\nand only if the quanti\ufb01ed Boolean formula is satis\ufb01able.\n\n7\n\n\fProof of Theorem 5. The proof is similar to that of Theorem 3, except that since the myopic policy\nis with respect to I(cid:48)\nG rather than IG, on any episode, the maximum number of mispredictions n is\nbounded above by \u2212 log(I(cid:48)\n\nG(g)). Hence, the average number of mispredictions is given by:\n\n(cid:19)\n\n(cid:18)\n\n(cid:80)\n(cid:80)\n\ng\n\ng\n\n(cid:88)\n) \u2212(cid:88)\n\n) =\n\ng\n\ng\n\nIG(g) log(\n\nIG(g) log(\n\n1\nI(cid:48)\nG(g)\n\nIG(g)\nI(cid:48)\nG(g)\n\nIG(g)\n\nlog(\n\n1\nI(cid:48)\nG(g)\n\n) + log(IG(g)) \u2212 log(IG(g))\n\n=\n\nIG(g) log(IG(g)) = H(IG) + KL(IG (cid:107) I\n\n(cid:48)\nG).\n\nProof of Theorem 7. According to the theory of POMDPs, the optimal action in a POMDP maxi-\nmizes the sum of the immediate expected reward and the value of the resulting belief state (of the\nassistant) [6]. When the agent policy is deterministic, the initial goal distribution IG and the history\nof agent actions and states H fully capture the belief state of the agent. Let V (IG, H) represent\nthe optimal value of the current belief state. It satis\ufb01es the following Bellman equation, where H(cid:48)\nstands for the history after the assistant\u2019s action hi and the agent\u2019s action aj.\n\nV (IG, H) = max\n\nE(R((s, hi), g, aj)) + V (IG, H\n\n)\n\nhi\n\nSince there is only one agent\u2019s action a\u2217(s, g) in \u03a0(s, g), the subsequent state s(cid:48) in H(cid:48), and its value\ndo not depend on hi. Hence the best helper action h\u2217 of the assistant is given by:\n\n(cid:48)\n\n(cid:88)\n\ng\u2208C(H)\n\nh\u2217(IG, H) = arg max\n\nhi\n\n= arg max\n\nhi\n\nE(R((s, hi), g, a\u2217(s, g))) = arg max\nIG(C(H) \u2229 G(s, ai))\n\nhi\n\nIG(g)I(ai \u2208 \u03a0(s, g))\n\nwhere C(H) is the set of goals consistent with the current history H, and G(s, ai) is the set of goals\ng for which ai \u2208 \u03a0(s, g). I(ai \u2208 \u03a0(s, g)) is an indicator function which is = 1 if ai \u2208 \u03a0(s, g).\nNote that h\u2217 is exactly the myopic policy.\n\nProof of Theorem 9. We \ufb01rst show that the problem is in NP. We build a tree representation of an\noptimal history-dependent policy for each initial state which acts as a polynomial-size certi\ufb01cate.\nEvery node in the tree is represented by a pair (si, Gi), where si is a state and Gi is a set of goals\nfor which the node is on a good path from the root node. We let hi be the helper action selected in\nnode i. The children of a node in the tree represent possible successor nodes (sj, Gj) reached by\nthe agent\u2019s response to hi. Note that multiple children can result from the same action because the\ndynamics is a function of the agent\u2019s goal.\nTo verify that the optimal policy tree is of polynomial size we note that the number of leaf nodes is\nupper bounded by |G|\u00d7 maxg N (g), where N (g) is the number of leaf nodes generated by the goal\ng and G is the set of all goals. To estimate N (g), we note that by our protocol, for any node (si, Gi)\nwhere g \u2208 Gi and the assistant\u2019s action is hi, if ai \u2208 \u03a0(s, g), it will have a single successor that\ncontains g. Otherwise, there is a misprediction, which leads to at most k successors for g. Hence, the\nnumber of nodes reached for g grows geometrically with the number of mispredictions. Since there\nare at most log |G| mispredictions in any such path, N (g) \u2264 klog2 |G| = klogk |G| log2 k = |G|log2 k.\nHence the total number of all leaf nodes of the tree is bounded by |G|1+log k, and the total number of\nnodes in the tree is bounded by m|G|1+log k, where m is the number of steps to the horizon. Since\nthis is polynomial in the problem parameters, the problem is in NP.\nTo show NP-hardness, we reduce 3-SAT to the given problem. We consider each 3-literal clause\nCi of a propositional formula \u03a6 as a possible goal. The rest of the proof is identical to that of\nTheorem 1 except that all variables are set by the assistant. The agent accepts every setting, except\npossibly the last one which he reverses if the clause evaluates to 0. Since the assistant does not get\nany useful information until it makes the clause true or fails to do so, its optimal policy is to choose\nthe assignment that maximizes the number of satis\ufb01ed clauses so that the mistakes are minimized.\nThe assistant makes a single prediction mistake on the last literal of each clause that is not satis\ufb01ed\nby the assignment. Hence, the worst regret on any goal is 0 iff the 3-SAT problem is satis\ufb01able.\n\nAcknowledgments\nThe authors gratefully acknowledge the support of NSF under grants IIS-0905678 and IIS-0964705.\n\n8\n\n\fReferences\n[1] Xinlong Bao, Jonathan L. Herlocker, and Thomas G. Dietterich. Fewer clicks and less frustration: reduc-\n\ning the cost of reaching the right folder. In IUI, pages 178\u2013185, 2006.\n\n[2] J. Boger, P. Poupart, J. Hoey, C. Boutilier, G. Fernie, and A. Mihailidis. A decision-theoretic approach to\n\ntask assistance for persons with dementia. In IJCAI, 2005.\n\n[3] Anton N. Dragunov, Thomas G. Dietterich, Kevin Johnsrude, Matt McLaughlin, Lida Li, and Jon L. Her-\nlocker. Tasktracer: A desktop environment to support multi-tasking knowledge workers. In Proceedings\nof IUI, 2005.\n\n[4] Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information\n\nand Computation, 82(3):231\u2013246, September 1989.\n\n[5] A. Fern, S. Natarajan, K. Judah, and P. Tadepalli. A decision-theoretic model of assistance. In Proceedings\n\nof the International Joint Conference in AI, 2007.\n\n[6] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially\n\nobservable stochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[7] Tessa A. Lau, Steven A. Wolfman, Pedro Domingos, and Daniel S. Weld. Programming by demonstration\n\nusing version space algebra. Machine Learning, 53(1-2):111\u2013156, 2003.\n\n[8] H. Lieberman. User interface goals, AI opportunities. AI Magazine, 30(2), 2009.\n[9] M. L . Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, Providence,\n\nRI, 1996.\n\n[10] Martin Mundhenk. The complexity of planning with partially-observable Markov Decision Processes.\n\nPhD thesis, Friedrich-Schiller-Universitdt, 2001.\n\n[11] K. Myers, P. Berry, J. Blythe, K. Conley, M. Gervasio, D. McGuinness, D. Morley, A. Pfeffer, M. Pollack,\nand M. Tambe. An intelligent personal assistant for task and time management. AI Magazine, 28(2):47\u2013\n61, 2007.\n\n[12] C. Papadimitriou and J. Tsitsiklis. The complexity of Markov Decision Processes. Mathematics of Oper-\n\nations Research, 12(3):441\u2013450, 1987.\n\n[13] M. Tambe. Electric Elves: What went wrong and why. AI Magazine, 29(2), 2008.\n\n9\n\n\f", "award": [], "sourceid": 889, "authors": [{"given_name": "Alan", "family_name": "Fern", "institution": null}, {"given_name": "Prasad", "family_name": "Tadepalli", "institution": null}]}