{"title": "Bounded Finite State Controllers", "book": "Advances in Neural Information Processing Systems", "page_first": 823, "page_last": 830, "abstract": "", "full_text": "Bounded Finite State Controllers\n\nPascal Poupart\n\nCraig Boutilier\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nUniversity of Toronto\nToronto, ON M5S 3H5\n\nUniversity of Toronto\nToronto, ON M5S 3H5\n\nppoupart@cs.toronto.edu\n\ncebly@cs.toronto.edu\n\nAbstract\n\nWe describe a new approximation algorithm for solving partially observ-\nable MDPs. Our bounded policy iteration approach searches through the\nspace of bounded-size, stochastic \ufb01nite state controllers, combining sev-\neral advantages of gradient ascent (ef\ufb01ciency, search through restricted\ncontroller space) and policy iteration (less vulnerability to local optima).\n\n1 Introduction\n\nFinite state controllers (FSCs) provide a simple, convenient way of representing policies\nfor partially observable Markov decision processes (POMDPs). Two general approaches\nare often used to construct good controllers: policy iteration (PI) [7] and gradient ascent\n(GA) [10, 11, 1]. The former is guaranteed to converge to an optimal policy, however, the\nsize of the controller often grows intractably. In contrast, the latter restricts its search to\ncontrollers of a bounded size, but may get trapped in a local optimum.\n\nWhile locally optimal solutions are often acceptable, for many planning problems with a\ncombinatorial \ufb02avor, GA can easily get trapped by simple policies that are far from opti-\nmal. Consider a system engaged in preference elicitation, charged with discovering optimal\nquery policy to determine relevant aspects of a user\u2019s utility function. Often no single ques-\ntion yields information of much value, while a sequence of queries does. If each question\nhas a cost, a system that locally optimizes the policy by GA may determine that the best\ncourse of action is to ask no questions (i.e., minimize cost given no information gain).\nWhen an optimal policy consists of a sequence of actions any small perturbation to which\nresults in a bad policy, there is little hope of \ufb01nding this sequence using methods that\ngreedily perform local perturbations such as those employed by GA.\n\nIn general, we would like the best of both worlds: bounded controller size and conver-\ngence to a global optimum. While achieving both is NP-hard for the class of deterministic\ncontrollers [10], one can hope for a tractable algorithm that at least avoids obvious local op-\ntima. We propose a new anytime algorithm, bounded policy iteration (BPI) that improves a\npolicy much like Hansen\u2019s PI [7] while keeping the size of the controller \ufb01xed. Whenever\nthe algorithm gets stuck in a local optimum, the controller is allowed to slightly grow by\nintroducing one (or a few) node(s) to escape the local optimum.\n\nFollowing a brief review of FSCs (Sec. 2), we extend PI to stochastic controllers (Sec. 3),\nthus admitting smaller, high quality controllers. We then derive the BPI algorithm by en-\nsuring that the number of nodes remains unchanged (Sec. 4). We analyze the structure of\n\n\flocal optima for BPI (Sec. 5), relate this analysis to GA, and use it to justify a new method\nto escape local optima. Finally, we report some preliminary experiments (Sec. 6).\n\n2 Finite State Controllers for POMDPs\n\n;\n\n,\n\n(1)\n\n(2)\n\nto be a distribution over states. Belief\n\n; a set of actions \u0001\n\n. We assume\ndiscrete state, action and observation sets and we focus on discounted, in\ufb01nite horizon\n\nA POMDP is de\ufb01ned by a set of states \n; a set of observations \u0002\na transition function \u0003\n, where \u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\f\b\r\u0006\u000f\u000e\u0011\u0010 denotes the transition probabilities \u0012\u0014\u0013\u0015\u0004\u0007\u0006\u000f\u000e\u0017\u0016\n\u0006\t\b\u000b\n\u0015\u0010 ;\nan observation function \u0018\n\u0006\t\b\u000b\n\u0015\u0010 of making\n, where \u0018\u0019\u0004\u001a\u0006\u001b\b\r\u001c\u001d\u0010 denotes the probability \u0012\u0014\u0013\u001e\u0004\u0007\u001c\f\u0016\nobservation \u001c\n, where \u001f \u0004\u0007\u0006\t\b\r\n\u001e\u0010\nin state \u0006 after taking action \n ; and a reward function \u001f\ndenotes the immediate reward associated with state \u0006 when executing ation \n\nPOMDPs with discount factor !#\"%$'&)( . Since states are not directly observable in\nPOMDPs, we de\ufb01ne a belief state *\u000f\u0004\u001a\u0006+\u0010-,.\u0012\u0014\u0013\u0015\u0004\u001a\u0006+\u0010\nstate* can be updated in response to a action-observation pair /0\n\f\b\u000b\u001c\u001e1 using Bayes rule.\nPolicies represented by FSCs are de\ufb01ned by a (possibly cyclic) directed graph 23,4/657\b98:1 ,\nwhere each node;=<>5\nis labeled by an action \n and each edge?-<@8 by an observation\n\u001c . Each node has one outward edge per observation. The FSC can be viewed as a policy\n24,A/0BC\b9D:1 , where action strategy B\nassociates each node ; with an action BC\u00040;E\u0010F<7\u0001\nand observation strategyD associates each node; and observation\u001c with a successor node\nlabeled with \u001c ). A policy is executed\nDG\u00040;H\b\u000b\u001c\u001e\u0010I<J5\nThe value functionKML of an FSC2\nits policy2\n\u00040;H\b\r\u0006N\u0010G,O\u001f \u0004\u001a\u0006\u001b\b\rBC\u00040;E\u00109\u0010EPQ$\u0014R\u001bS4\u0012\u0014\u0013\u0015\u0004\u0007\u0006\n\u0006\u001b\b\rBC\u00040;E\u00109\u00109\u0012\u0014\u0013\u001e\u0004\u0007\u001c\f\u0016\nGiven an initial belief state * , an FSC\u2019s value at node ;\nKW\u00040;H\b\r*V\u0010X,ZY#[\\*\u000f\u0004\u001a\u0006+\u0010\u000bKW\u00040;H\b\r\u0006N\u0010 ; the best starting node for a given *\n^-_\u000f`ba\nK-\u0004U;H\bc*V\u0010 . As a result, the value KW\u0004U;H\bc*V\u0010 of each node ;\nK\u0005d satis\ufb01es Bellman\u2019s equation:\n^-_\u000f`\n\nis linear with respect to the\nbelief state; hence the value function of the controller is piecewise-linear and convex. In\nFig. 1(a), each linear segment corresponds to the value function of a node and the upper\nsurface of these segments forms the controller value function. The optimal value function\n\nby taking the action associated with the \u201ccurrent node,\u201d and updating the current node by\nfollowing the edge labeled by the observation made.\n\n(corresponding to the edge from ;\n\n\u00040DG\u0004U;H\b\r\u001c\u001d\u0010V\b\r\u0006\n\n\b\u000bBT\u0004U;E\u00109\u0010\u000bK\nis determined by K]\u0004\u0007*V\u0010X,\n\n, and can be computed by solving a set of linear equations:\n\nis the expected discounted sum of rewards for executing\n\nis simply the expectation\n\n\u001f \u0004\u001a*N\b\u000b\n\u0015\u0010EPg$\u0014R\u001bS4\u0012\u0014\u0013\u0015\u00040\u001c\\\u0016\n\n*N\b\r\n\u0015\u00109KW\u0004\u001a*\n\nPolicy iteration (PI) [7] incrementally improves a controller by alternating between two\nsteps, policy improvement and policy evaluation, until convergence to an optimal policy.\nPolicy evaluation solves Eq. 1 for a given policy. Policy improvement adds nodes to the\ncontroller by dynamic programming (DP) and removes other nodes. A DP backup applies\nin Fig. 2(a)) of the current controller to obtain a\n\nthe r.h.s. of Eq. 2 to the value function (K\nnew, improved value function (K\u0019\u000e in Fig. 2(a)). Each linear segment of K\u0019\u000e corresponds to a\n\nnew node added to the controller. Several algorithms can be used to perform DP backups,\nwith incremental pruning [4] perhaps being the fastest. After the new nodes created by\nDP have been added, old nodes that are now pointwise dominated are removed. A node\nis pointwise dominated when its value is less than that of some other node at all belief\nin Fig. 2(a)). The inward edges of a pointwise\ndominated node are re-directed to the dominating node since it offers better value (e.g.,\nin Fig. 2(c)). The controller resulting from this\npolicy improvement step is guaranteed to offer higher value at all belief states. On the\n\nstates (e.g.,;ih\ninward arcs of ;\nother hand, up to \u0016\n\nis pointwise dominated by;Ej\nh are redirected to ;\n\u0001k\u0016\u0011\u0016\n\nmCl new nodes may be added with each DP backup, so the size of\n\nthe controller quickly becomes intractable in many POMDPs.\n\n54\u0016\u0011l\n\n\u0004\u0007*V\u0010e,\n\nK\nL\n\u000e\n\u0016\n\u0006\n\u000e\nL\n\u000e\n\u0010\nK\nd\nf\nf\nS\n\u0010\nj\n\fupper surface:\nconvex combination:\n\nvalue function:\nbacked up value function:\n\nn2\n\ne\nu\nl\na\nv\n\nn1\n\nn3\n\ne\nu\nl\na\nv\n\nbelief space\n\na)\n\nb\u2019\n\nb\n\nb)\n\nFigure 1: a) Value function example; b) BPI local optimum: each linear segment of the\nvalue function is tangent to the backed up value function\n\nV:\n\nn4\n\n1n\n\ne\nu\nl\na\nv\n\nV\u2019:\n\nn3\n\nn2\n\nbelief space\n\na)\n\nn3\n\nn4\n\na\n\nb\n\nb)\n\nc\n\n1n\n\nn2\n\nn3\na\n\nn4\n\nc\n\nn2\n\nb\n\nc)\n\nFigure 2: a) Value functionK\n(;ih and;\nnated node;\n\n\u0001 ) with nodes added (;\u0003\u0002\n\nand the backed-upK\n\n\u000e obtained by DP; b) original controller\nj ) by DP; c) new controller once pointwise domi-\n\nis removed and its inward arcs a, b, c are redirected to ;\n\nand;\n\n3 Policy Iteration for Stochastic Controllers\n\nPolicy iteration only prunes nodes that are pointwise dominated, rather than all dominated\nnodes. This is because the algorithm is designed to produce controllers with deterministic\nobservation strategies. A pointwise-dominated node can safely be pruned since its inward\narcs are redirected to the dominating node (which has value at least as high as the dominated\nin\nnode at each state). In contrast, a node jointly dominated by several nodes (e.g.,\n\nredirected to different nodes depending on the current belief state.\n\nThis problem can be circumvented by allowing stochastic observation strategies. We revise\n\nj ) cannot be pruned without its inward arcs being\nand;\nFig. 2(b) is jointly dominated by;\u0004\u0002\n;H\b\r\u001c\u001d\u0010 , de\ufb01ning a distribution over\nthe notion of observation strategy DC\u0004U;H\b\u000b\u001c\n\b9;i\u000e\u0011\u0010\u0019,\n\u0012\u0014\u0013\u0015\u0004U;\nsuccessor nodes ;E\u000e for each ;H\b\u000b\u001c -pair. If the stochastic strategy is chosen carefully, the\nis dominated by;\nwe would like to prune. In Fig. 1(a), ;\n\u0001 and;\nthem alone). Convex combinations of ;\n\u0001 and;\u0003\u0002\n. The dotted line illustrates one convex combination of ;\nthe intersection of;\n\u0001 and;\u0005\u0002\nh : consequently, ;\nthat pointwise dominates ;\n\ncorresponding convex combination of dominating nodes may pointwise dominate the node\ntogether (but neither of\ncorrespond to all lines that pass through\n\u0001 and\narcs re-directed to re\ufb02ect this convex combination by setting the observation probabilities\naccordingly. In general, when a node is jointly dominated by a group of nodes, there exists\na pointwise-dominating convex combination of this group.\n\nh can be safely removed and its inward\n\nTheorem 1 The value functionK]\u0004U;H\b\u0007\u0006\ntionsK]\u0004U;ih\n\b\rK-\u00040;\u0003\n\u001e\b\b\u0006\n\b\u0007\t\b\t\b\t\nthat dominatesKW\u00040;H\b\b\u0006\n\u0010 .\nKW\u0004U;\nY\f\u000b\u000e\n\n\u0010 of nodes;:hN\b\u0007\t\b\t\b\t\n\n\u0010 of a node;\n\b9;\u0003\n\n\b\b\u0006\n\b\b\u0006\n\nis jointly dominated by the value func-\nif and only if there is a convex combination\n\nh\nj\n;\n\u0001\n\u000e\n\u0016\nh\n\u0002\n;\n\u0002\n\u0010\n\u000b\n\u000b\n\u0010\n\f^\u0001\u0003\u0002\n\ns.t.\n\ns.t.\n\n^-_\u000f`\n\n\u0004\u0006\u0005\n*\u000f\u0004\u0007\u0006N\u0010\u000bKW\u0004U;H\bc\u0006N\u0010\n*\u000f\u0004\u0007\u0006N\u0010e,4(\f\u000b'*\u000f\u0004\u0007\u0006N\u0010\n\n\b\r\u0006N\u0010\n\b\b\u0007\n\t\n\b\rKW\u00040;\u0003\n\u001e\b\b\u0006\n\u0006\u0014<\n\nTable 1: Primal LP:KW\u0004U;H\b\u0007\u0006\n\n*\u000f\u0004\u0007\u0006N\u00109K]\u0004U;\nY#[\n!b\b\r\u0007\nis jointly dominated by KW\u00040;:h\u000f\b\b\u0006\n\b\b\t\u0007\t\b\t\nK]\u0004U;\n\b\b\u0007\n\bc\u0006N\u0010\nKW\u00040;H\b\r\u0006N\u0010EP\n!b\b\b\u0007\n\t\nTable 2: Dual LP: convex combination Y\n! .\n\u0010 dominatesKW\u0004U;H\b\u0007\u0006\nKW\u00040;\n\b\u0007\u0006\nProof: KW\u0004U;H\b\u0007\u0006\nis dominated by KW\u0004U;\n\u0010 when the objective of the LP in\n\b\u0007\t\b\t\u0007\t\n\b\rK]\u0004U;\u0003\n\u001e\b\u0007\u0006\nhN\b\b\u0006\nTable 1 is positive. This LP \ufb01nds the belief state *\n\b\r*V\u0010 . It turns out that the dual LP (Table 2) \ufb01nds\nKW\u00040;H\b\r*V\u0010 and the max ofKW\u00040;\n\b\r*V\u0010V\b\b\t\u0007\t\b\tV\bcKW\u0004U;\nthe most dominating convex combination parallel to KW\u0004U;H\b\b\u0006\n\u0010 . Since the dual has positive\nobjective value when the primal does, the theorem follows. \u000f\n\nthat minimizes the difference between\n\n\u0010 when \u0004\u0006\u0005\n\n\u0010 when \u0004\u000e\u0005\n\n! .\n\n(\f\u000b\n\n\u000b\u000e\n\n\u000b\u000e\n\nAs argued in the proof of Thm. 1, the LP in Table 1 gives us an algorithm to \ufb01nd the most\ndominating convex combination parallel to a dominated node. In summary, by considering\nstochastic controllers, we can extend PI to prune all dominated nodes (pointwise or jointly)\nin the policy improvement step. This provides two advantages: controllers can be made\nsmaller while improving their decision quality.\n\n4 Bounded Policy Iteration\n\nAlthough pruning all dominated nodes helps to keep the controller small, it may still grow\nsubstantially with each DP backup. Several heuristics are possible to bound the number of\nnodes. Feng and Hansen [6] proposed that one prunes all nodes that dominate the value\nfunction by less than some \u0004 after each DP backup. Alternatively, instead of growing the\ncontroller with each backup and then pruning, we can do a partial DP backup that generates\nonly a subset of the nodes using Cheng\u2019s algorithm [5], the witness algorithm [9], or other\nheuristics [14]. In order to keep the controller bounded, for each node created in a partial\nDP backup, one node must be pruned and its inward arcs redirected to some dominating\nconvex combination. In the event where no node is dominated, we can still prune a node\nand redirect its arcs to a good convex combination, but the resulting controller may have\nlesser value at some belief states. We now propose a new algorithm called bounded pol-\nicy iteration (BPI) that guarantees monotonic value improvement at all belief states while\nkeeping the number of nodes \ufb01xed.\n\nBPI considers one node at a time and tries to improve it while keeping all other nodes\n\ufb01xed. Improvement is achieved by replacing each node by a good convex combination of\nthe nodes normally created by a DP backup, but without actually performing a backup.\nSince the backed up value function must dominate the controller\u2019s current value function,\nthen by Thm. 1 there must exist a convex combination of the backed up nodes that point-\nwise dominates each node of the controller. Combining this idea with Eq. 2, we can directly\n\nmCl vari-\ncompute such convex combinations with the LP in Table 3. This LP has \u0016\nables corresponding to the probabilities of the convex combination as well as the \u0004 variable\nmeasuring the value improvement. We can signi\ufb01cantly reduce the number of variables\nby pushing the convex combination variables as far as possible into the DP backup, result-\ning in the LP shown in Table 4. The key here is to realize that we can aggregate many\n\u001a\u001c\u0018 and\nvariables since we only care about the marginals\n\u001a\u001b\u0018 .\n\na\u0015\u0014\u0013\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n,'Y\n\n54\u0016\n\na\u0011\u0010\u0013\u0012\n\n\u0001]\u0016\n\na\u0019\u0018\n\na\u0019\u0018\n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n\u001a\u001b\u0018\n\na\u0011\u001d\n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\na\u0011\u001d\u001f\u001e\n\na\f\u001d! \n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n\u001a\u001b\u0018\n\na\u0019\u0018\n\n\u0004\nY\n[\nP\nY\n[\n\u000b\n\u0005\n\u0006\n\u0010\n\u0010\n\u0004\n\u0004\n\"\nY\n\u000b\n\u000b\n\nY\n\u000b\n,\n\n\u000b\n\u0005\n\u000b\n\n\u000b\n\u000b\n\u0010\n\u0010\nh\n\n\u0016\nl\n\nf\n\u0012\n\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\n\nf\n\u0012\n,\nY\na\n\u0010\n\u0012\n\u0010\n\u0012\n\u0010\n\u0012\n\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\na\n\u0018\n\f^-_\u000f`\n\ns.t.\n\nK]\u0004U;H\b\r\u0006N\u0010EP\n\n\"OY\n\u001a\u001b\u0018\n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\u0012\u0014\u0013\u0015\u0004\u0007\u0006+\u000e\na\u0019\u0018\na\f\u0014\u0013\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\n[\u0003\u0002\na\u0011\u0010\u001f\u0012\n\na\u0015\u0014\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\u0010\u001f\u0012\n\u001a\u001b\u0018\n\u001f \u0004\u0007\u0006\t\b\r\n\u001e\u0010\rP\n\u001a\u001b\u0018\n\u0006\t\b\r\n\u001e\u00109\u0012\u0014\u0013\u0015\u00040\u001c\\\u0016\n\u0006+\u000e\u001a\b\u000b\n\u0015\u00109KW\u0004U;\n\b\r\u0006+\u000e\n\u0010\u0005\u0004\u0017\b\na\u0019\u0018\na\u0015\u0014\na\u0011\u0010\u001f\u0012\n\u001a\u001c\u0018\n\u0012\u0017\u0016\u0017\u0016\u0017\u0016\n\u001a\u001b\u0018\n(\u0011\u000b\n\n!b\b\n\n\u0006\u0014<\n\n\f\b9;\n\n\b\u000b;\n\ns.t.\n\nmCl\n\n.\n\n,7(\u0011\u000b\n\n\u001f \u0004\u0007\u0006\t\b\u000b\n\u0015\u0010\n\nKW\u0004U;H\bc\u0006N\u0010\n\n\b9;\n\b\b\t\u0007\t\b\t\nTable 3: Naive LP to \ufb01nd a convex combination of backed up nodes that dominate ;\n^-_\n\u0010\u0003\u0004\u001a\b!\u0007\n\b\r\u0006+\u000e\nTable 4: Ef\ufb01cient LP to \ufb01nd a convex combination of backed up nodes that dominate ;\n( variables.1 Furthermore, the vari-\nThe ef\ufb01cient LP in Table 4 has only \u0016\n\u0002>\u0016\u0011\u0016\n\u0001]\u0016\n\u0001]\u0016\nPg\u0016\n54\u0016\na\f\u001d have an intuitive interpretation w.r.t. the action and observation strategies\nf and\n(i.e.,\nf ). Similarly, each\n\u001d variable indicates the (unnormalized) probability of\nBT\u0004U;H\b\u000b\n\u0015\u0010\nS after executing\n and observing\u001c\nf ). Note\nreaching node;\n(i.e.,DG\u0004U;H\b\n\nthat we now use probabilistic action strategies and have extended probabilistic observation\nstrategies to depend on the action executed.\n\nf variable indicates the probability of executing action\n\nables\nfor the improved node. Each\n\n\u0012\u0014\u0013\u001e\u0004\u001a\u0006+\u000e\u0017\u0016\n!b\b\n\n\u0006+\u000e0\b\r\n\u001e\u0010\n!b\b\n\nKW\u00040;\n\n\f\b\u000b\u001c\n\n\u0012\u0014\u0013\u0015\u0004\u0007\u001c\f\u0016\n\n\u0006\t\b\u000b\n\u0015\u0010\n\u0007\\\n\u001b\u000b\n\n\b\r\u001c\n\n\b\u000b;\n\n\u0010T,\n\nPg$\n\n\u001b\u000b\n\na\f\u001d\n\n\u001d\u0007\u0006\n\n.\n\nTo summarize, BPI alternates between policy evaluation and improvement as in regular PI,\nbut the policy improvement step simply tries to improve each node by solving the LP in\nTable 4. The\nstrategies of the new improved node.\n\na\f\u001d variables are used to set the probabilistic action and observation\n\nf and\n\n5 Local Optima\n\nBPI is a simple, ef\ufb01cient alternative to standard PI that monotonically improves an FSC\nwhile keeping its size constant. Unfortunately, it is only guaranteed to converge to a local\noptimum. We now characterize BPI\u2019s local optima and propose a method to escape them.\n\n5.1 Characterization\n\nThm. 2 gives a necessary and suf\ufb01cient condition characterizing BPI\u2019s local optima. Intu-\nitively, a controller is a local optimum when each linear segment touches from below, or is\ntangent to, the controller\u2019s backed up value function (see Fig. 1(b)).\n\nTheorem 2 BPI has converged to a local optimum if and only if each node\u2019s value function\nis tangent to the backed up value function.\n\nProof: Since the objective function of the LP in Table 4 seeks to maximize the improve-\nment \u0004 , the resulting convex combination must be tangent to the upper surface of the\nbacked up value function. Conversely, the only time when the LP won\u2019t be able to improve\na node is when its vector is already tangent to the backed up value function. \u000f\n\n1Actually, we don\u2019t need the\n\nvariables since they can be derived from the\n\nsumming out\n\n, so the number of variables can be reduced to\n\n\b\n\t\n\n\u001d variables by\n\n\b\n\t\f\u000b\n\n.\n\n\u0012\u0013\u0011\u0014\u0011\n\n\u0015\u0013\u0011\u0014\u0011\n\n\u0016\u0017\u0011\u0019\u0018\u001b\u001a\n\n\u000e\u0010\u000f\n\n\u0004\n\u0004\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\na\n\u0018\n\nf\n\u0012\na\n\u0012\na\n\u0018\n\u0001\n$\nY\n\u0012\nS\n\u0016\nS\n\u0007\n\nY\nf\n\u0012\na\n\u0010\n\u0012\na\n\u0014\n\u0012\na\n\u0018\n\nf\n\u0012\n\u0012\n,\n\nf\n\u0012\n\u0012\n\u0005\n\u0007\nh\n\u0001\nl\n`\n\u0004\nP\n\u0004\n\"\nY\nf\n\u0001\n\nf\nY\n[\n\u0002\n\u0012\nS\n\nf\n\u0012\na\n\u001d\nS\n\u0006\nY\nf\n\nf\nY\n\nf\n\u0012\na\n\u001d\n,\n\nf\n\b\n\u0007\n\nf\n\u0005\n\nf\n\u0012\na\n\u001d\n\u0005\n\u0007\n\u0016\nP\n\nf\n\u0012\n\n,\n\nf\n\u0012\na\nS\n\nf\n\u0012\na\n\nf\n\u0012\n\n\u0011\n\fInterestingly, tangency is a necessary (but not suf\ufb01cient) condition for GA\u2019s local optima.\n\nCorollary 1 If GA has converged to a local optimum, then the value function of each node\nreachable from the initial belief state is tangent to the backed up value function.\n\nProof: GA seeks to monotonically improve a controller in the direction of steepest ascent.\nThe LP of Table 4 also seeks a monotonically improving direction. Thus if BPI can\nimprove a controller by \ufb01nding a direction of improvement using the LP of Table 4, then\nGA will also \ufb01nd it or will \ufb01nd a steeper one. Conversely, when a controller is a local\noptimum for GA, then there is no monotonic improvement possible in any direction. Since\nBPI can only improve a controller by following a direction of monotonic improvement,\nGA\u2019s local optima are a subset of BPI\u2019s local optima. Thus, tangency is a necessary, but\n\nnot suf\ufb01cient, condition of GA\u2019s local optima. \u000f\n\nIn the proof of Corollary 1, we argued that GA\u2019s local optima are a subset of BPI\u2019s local\noptima. This suggests that BPI is inferior to GA since it can be trapped by more local\noptima than GA. However we will describe in the next section a simple technique that\nallows BPI to easily escape from local optima.\n\n5.2 Escape Technique\n\nThe tangency condition characterizing local optima can be used to design an effective es-\ncape method for BPI. It essentially tells us that such tangent belief states are \u201cbottlenecks\u201d\nfor further policy improvement. If we could improve the value at the tangent belief state(s)\nof some node, then we could break out of the local optimum. A simple method for doing\nso consists of a one-step lookahead search from the tangent belief states. Figure 1(b) illus-\n\ntrates how belief state *V\u000e can be reached in one step from tangent belief state * , and how\nthe backed up value function improves *\n\u000e \u2019s current value. Thus, if we add a node to the\ncontroller that maximizes the value of *\n\u000e , its improved value can subsequently be backed\nup to the tangent belief state * , breaking out of the local optimum.\n\nOur algorithm is summarized as follows: perform a one-step lookahead search from each\ntangent belief state; when a reachable belief state can be improved, add a new node to the\ncontroller that maximizes that belief state\u2019s value. Interestingly, when no reachable belief\nstate can be improved, the policy must be optimal at the tangent belief states.\n\nTheorem 3 If the backed up value function does not improve the value of any belief state\nreachable in one step from any tangent belief state, then the policy is optimal at the tangent\nbelief states.\n\nProof: By de\ufb01nition, belief states for which the backed up value function provides no\nimprovement are tangent belief states. Hence, when all belief states reachable in one step\nare themselves tangent belief states, then the set of tangent belief states is closed under\nevery policy. Since there is no possibility of improvement, the current policy must be\n\noptimal at the tangent belief states. \u000f\n\nAlthough Thm 3 guarantees an optimal solution only at the tangent belief states, in practice,\nthey rarely form a proper subset of the belief space (when none of the reachable belief states\ncan be improved). Note also that the escape algorithm assumes knowledge of the tangent\nbelief states. Fortunately, the solution to the dual of the LP in Table 4 is a tangent belief\nstate. Since most commercial LP solvers return both the solution of the primal and dual, a\ntangent belief state is readily available for each node. 2\n\n2A node may have more than one tangent belief state when an interval of its linear segment is\n\n\f55\n\n50\n\n45\n\n40\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n35\n0\n\n55\n\n50\n\n45\n\n40\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n35\n100\n\n101\n\nMaze400\n\nTag\u2212Avoid\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n500\n\n1000\nNumber of nodes\n\n1500\n\n0\n\n500\n1000\nNumber of nodes\n\n1500\n\nMaze400\n\nTag\u2212Avoid\n\ns\nd\nr\na\nw\ne\nR\nd\ne\n\n \n\nt\nc\ne\np\nx\nE\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n104\n\n105\n\n101\n\n102\n\n103\n\n104\n\nTime (seconds)\n\n105\n\n106\n\n102\n\n103\n\nTime (seconds)\n\nFigure 3: Experimental results for the maze and tag-avoid problems.\n\n6 Experiments\n\nWe report some preliminary experiments with BPI and the escape method to assess their\nrobustness against local optima, as well as their scalability to relatively large POMDPs.\nIn a \ufb01rst experiment, we ran BPI with escape on a preference elicitation problem and a\nmodi\ufb01ed version of the Heaven-and-Hell problem described in [3]. It consistently found\nthe optimal policy, whereas GA settles for a local optimum for both problems.\n\n! -state maze problem, and the second Pineau et al.\u2019s [12]\n\nIn a second experiment, we report the running time and decision quality of the con-\ntrollers found for two large grid-world problems. The \ufb01rst is a\nHauskrecht\u2019s [8]\navoid problem. In Figure 3, we report the expected return achieved w.r.t. time and number\nof nodes. For the maze problem, the expected return is averaged over all 400 states since\nBPI tries to optimize the policy for all belief states simultaneously. For comparison pur-\nposes, the expected return for the tag-avoid problem is measured at the same initial belief\nstate used in [12] even though BPI doesn\u2019t tailor its policy exclusively to that belief state.\nIn contrast, many point-based algorithms including PBVI [12] (which is perhaps the best\nsuch algorithm) optimize the policy for a single initial belief state, capitalizing on a hope-\nfully small reachable belief region. BPI found a\nsame expected return of\nlinear\nsegments. This suggests that most of the belief space is reachable in tag-avoid. We also\n\n\t!\u001b! -state extention of\n\u0002\u0004\u0003\u000f! -state tag-\n\nachieved by PBVI in (\u000b\u0002\u001b!\b\u0002\f\u0002\u001b!\u001d\u0006 with a policy of (\u000e\r\f\r\u0006\n\n\u0005\u0006\t! -node controller in\n\n\u0007\u0006\u0005\u0004\u0003\b\u0003\u0006\u0001\t\u0006 with the\n\n\t\u0011(\u000b\u0002\n\n\t\n\u0005\n\ntangent to the backed up value function, indicating that it is identical to some backed up node.\n\n\u0001\n\fran BPI on the tiger-grid, hallway and hallway2 benchmark problems [12] and obtained\n\n\u0001\u0006\u0002\n\n\u0001\f\u0003\n\ntailor the policy. In contrast, PBVI achieved expected returns of\n\nat the same initial belief states used in [12], but without using them to\nin\nlinear segments tailored to those\ninitial belief states. This suggests that only a small portion of the belief space is reachable.\n\n\u0001\b\u0002\u001b!\t\u0006 achieving expected returns of\n\nand !\n\n, !\n\n(\u000b\u0007\u001b!\u001b! -node controllers in (\u0001\b\r\u0006\u0004\u0001\n\u0002b( , !\n\r\u0006\f\f\u0002\u001d\u0006 ,\n\n\u0007\u0015( , !\n\u0001\b\u0002\b\u0002\t\u0006 and\n\n\u0002\t!\t\u0006 with policies of\n\n!\u001d\u0006 ,\n\nand\n\n\t!\t\u0006 and\n! ,\n\n\u0002\u0002\n\n\u0004\u0003\n\n\f\u0005\u0004\u0003\n\n\u0001\b\u0007\n\n\u0007\u0006\n\n\b\n\n7 Conclusion\n\nWe have introduced the BPI algorithm, which guarantees monotonic improvement of the\nvalue function while keeping controller size \ufb01xed. While quite ef\ufb01cient, the algorithm may\nget trapped in local optima. An analysis of such local optima reveals that the value function\nof each node is tangent to the backed up value function. This property can be successfully\nexploited in an algorithm that escapes local optima quite robustly.\n\nThis research can be extented in a number of directions. State aggregation [2] and belief\ncompression [13] techniques could be easily integrated with BPI to scale to problems with\nlarge state spaces. Also, since stochastic GA [11, 1] can tackle model free problems (which\nBPI cannot) it would be interesting to see if tangent belief states could be computed for\nstochastic GA and used to design a heuristic to escape local optima similar to the one\nproposed for BPI.\n\nAcknowledgements We thank Darius Braziunas for his help with the implementation and the anony-\nmous reviewers for the helpful comments.\n\nReferences\n\n[1] D. Aberdeen and J. Baxter. Scaling internal-state policy-gradient methods for POMDPs. Proc.\n\nICML-02, pp.3\u201310, Sydney, Australia, 2002.\n\n[2] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision pro-\n\ncesses using compact representations. Proc. AAAI-96, pp.1168\u20131175, Portland, OR, 1996.\n\n[3] D. Braziunas. Stochastic local search for POMDP controllers. Master\u2019s thesis, University of\n\nToronto, Toronto, 2003.\n\n[4] A. R. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact\n\nmethod for POMDPs. Proc.UAI-97, pp.54\u201361, Providence, RI, 1997.\n\n[5] H.-T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis,\n\nUniversity of British Columbia, Vancouver, 1988.\n\n[6] Z. Feng and E. A. Hansen. Approximate planning for factored POMDPs. Proc. ECP-01, Toledo,\n\nSpain, 2001.\n\n[7] E. A. Hansen. Solving POMDPs by searching in policy space. Proc. UAI-98, pp.211\u2013219,\n\nMadison, Wisconsin, 1998.\n\n[8] M. Hauskrecht. Value-function approximations for partially observable Markov decision pro-\n\ncesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[9] L. P. Kaelbling, M. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[10] N. Meuleau, K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra. Solving POMDPs by searching\n\nthe space of \ufb01nite policies. Proc. UAI-99, pp.417\u2013426, Stockholm, 1999.\n\n[11] N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning \ufb01nite-state controllers for\n\npartially observable environments. Proc. UAI-99, pp.427\u2013436, Stockholm, 1999.\n\n[12] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for\n\nPOMDPs. In Proc. IJCAI-03, Acapulco, Mexico, 2003.\n\n[13] P. Poupart and C. Boutilier. Value-directed compressions of POMDPs. Proc. NIPS-02, pp.1547\u2013\n\n1554, Vancouver, Canada, 2002.\n\n[14] N. L. Zhang and W. Zhang. Speeding up the convergence of value-iteration in partially observ-\n\nable Markov decision processes. Journal of Arti\ufb01cial Intelligence Research, 14:29\u201351, 2001.\n\n\u0001\n\n(\n\t\n\t\n\t\n\u0001\n\t\n\t\n\t\n\u0005\n\u0007\n\f", "award": [], "sourceid": 2372, "authors": [{"given_name": "Pascal", "family_name": "Poupart", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}