Pascal Poupart, Craig Boutilier
Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique  with Bounded Pol- icy Iteration (BPI) . The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.
Partially observable Markov decision processes (POMDPs) provide a natural and expres- sive framework for decision making, but their use in practice has been limited by the lack of scalable solution algorithms. Two important sources of intractability plague discrete model-based POMDPs: high dimensionality of belief space, and the complexity of policy or value function (VF) space. Classic solution algorithms [4, 10, 7], for example, compute value functions represented by exponentially many value vectors, each of exponential size. As a result, they can only solve POMDPs with on the order of 100 states. Consequently, much research has been devoted to mitigating these two sources of intractability.
The complexity of policy/VF space has been addressed by observing that there are often very good policies whose value functions are representable by a small number of vectors. Various algorithms such as approximate vector pruning , point-based value iteration (PBVI) [12, 16], bounded policy iteration (BPI) , gradient ascent (GA) [11, 1] and stochastic local search (SLS)  exploit this fact to produce (often near-optimal) policies of low complexity (i.e., few vectors) allowing larger POMDPs to be solved. Still these scale to problems of only roughly 1000 states, since each value vector may still have ex- ponential dimensionality. Conversely, it has been observed that belief states often carry more information than necessary. Hence, one can often reduce vector dimensionality by using compact representations such as decision trees (DTs) , algebraic decision dia- grams (ADDs) [8, 9], or linear combinations of small basis functions (LCBFs) , or by indirectly compressing the belief space into a small subspace by a value-directed compres- sion (VDC)  or exponential PCA . Once compressed, classic solution methods can be used. However, since none of these approaches address the exponential complexity of
policy/VF space, they can only solve slightly larger POMDPs (up to 8250 states ).
Scalable POMDP algorithms can only be realized when both sources of intractability are tackled simultaneously. While Hansen and Feng  implemented such an algorithm by combining approximate state abstraction with approximate vector pruning, they didn't demonstrate the scalability of the approach on large problems. In this paper, we describe how to combine value directed compression (VDC) with bounded policy iteration (BPI) and demonstrate the scalability of the resulting algorithm (VDCBPI) on synthetic network management problems of up to 33 million states. Among the techniques that deal with the curse of dimensionality, VDC offers the advantage that the compressed POMDP can be di- rectly fed into existing POMDP algorithms with no (or only slight) adjustments. This is not the case for exponential-PCA, nor compact representations (DTs, ADDs, LCBFs). Among algorithms that mitigate policy space complexity, BPI distinguishes itself by its ability to avoid local optima (cf. GA), its efficiency (cf. SLS) and the fact that belief state monitoring is not required (cf. PBVI, approximate vector pruning). Beyond the combination of VDC with BPI, we offer two other contributions. We propose a new simple heuristic to compute good lossy value directed compressions. We also augment BPI with the ability to bias its policy search to reachable belief states. As a result, BPI can often find a much smaller policy of similar quality for a given initial belief state.
2 POMDP Background
A POMDP is defined by: states S; actions A; observations Z; transition function T , where T (s, a, s ) denotes Pr(s |s, a); observation function Z, where Z(s, z) is the probability Pr(z|s, a) of observation z in state s after executing a; and reward function R, where R(s, a) is the immediate reward associated with s when executing a. We assume discrete state, action and observation sets and focus on discounted, infinite horizon POMDPs with discount factor 0 < 1.
Policies and value functions for POMDPs are typically defined over belief space B, where a belief state b is a distribution over S capturing an agent's knowledge about the current state of the world. Belief state b can be updated in response to a specific action-observation pair a, z using Bayes rule. We denote the (unnormalized) belief update mapping by T a,z, where T a,z = Pr(s ij j |a, si) Pr(z|sj ). A factored POMDP, with exponentially many states, thus gives rise to a belief space of exponential dimensionality.
Policies represented by finite state controllers (FSCs) are defined by a (possibly cyclic) di- rected graph = N , E , where nodes n N correspond to stochastic action choices and edges e E to stochastic transitions. An FSC can be viewed as a policy = , , where action strategy associates each node n with a distribution over actions (n) = Pr(a|n), and observation strategy associates each node n and observation z with a distribution over successor nodes (n, z) = Pr(n |n, z) (corresponding to the edge from n labeled with z). The value function V of FSC is given by:
V (n, s) = Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(n |n, z)V (n , s ) (1) a z n
The value V (n, b) of each node n is thus linear w.r.t the belief state; hence the value function of the controller is piecewise-linear and convex. The optimal value function V often has a large (if not infinite) number of vectors, each corresponding to a different node. The optimal value function V satisfies Bellman's equation:
V (b) = max R(b, a) + Pr(z|b, a)V (ba) (2) z a z max s.t. V (n, s) + [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s a s ,z Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a a n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z Table 1: LP to uniformly improve the value function of a node. max o(s, n) s,n s,n s.t. V (n, s) + s,n [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s , a) Pr(a, n |n, z)V (n , s )], s a s ,z Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a a n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z
Table 2: LP to improve the value function of a node in a non-uniform way according to the steady state occupancy o(s, n).
3 Bounded Policy Iteration
We briefly review the bounded policy iteration (BPI) algorithm (see  for details) and describe a simple extension to bias its search toward reachable belief states. BPI incre- mentally constructs an FSC by alternating policy improvement and policy evaluation. Un- like policy iteration , this is done by slowly increasing the number of nodes (and value vectors). The policy improvement step greedily improves each node n by optimizing its action and observation strategies by solving the linear program (LP) in Table 1. This LP uniformly maximizes the improvement in the value function by optimizing n's distribu- tions Pr(a, n |n, z). The policy evaluation step computes the value function of the current controller by solving Eq. 1. The algorithm monotonically improves the policy until con- vergence to a local optimum, at which point new nodes are introduced to escape the local optimum. BPI is guaranteed to converge to a policy that is optimal at the "tangent" belief states while slowly growing the size of the controller .
In practice, we often wish to find a policy suitable for a given initial belief state. Since only a small subset of belief space is often reachable, it is generally possible to construct much smaller policies tailored to the reachable region. We now describe a simple way to bias BPI's efforts toward the reachable region. Recall that the LP in Table 1 optimizes the parameters of a node to uniformly improve its value at all belief states. We propose a new LP (Table 2) that weighs the improvement by the (unnormalized) discounted occupancy distribution induced by the current policy. This accounts for belief states reachable for the node by aggregating them together. The (unnormalized) discounted occupancy distribution is given by:
o(s , n ) = b0(s , n ) + o(s, n) Pr(a|n) Pr(z|a, s) Pr(n |n, z) s , n s,a,z,n
The LP in Table 2 is obtained by introducing variables s,n for each s, replacing the ob- jective by o(s, n) s,n s,n and replacing in each constraint by the corresponding s,n. When using the modified LP, BPI naturally tries to improve the policy at the reachable be- lief states before the others. Since the modification ensures that the value function doesn't decrease at any belief state, focusing the efforts on reachable belief states won't decrease policy value at other belief states. Furthermore, though the policy is initially biased toward reachable states, BPI will eventually improve the policy for all belief states.
T ~ ~ T T T f ~ b ~ b b' b' ~ ~ R R R R r r'
Figure 1: Functional flow of a POMDP (dotted arrows) and a compressed POMDP (solid arrows).
4 Value-Directed Compression
We briefly review the sufficient conditions for a lossless compression of POMDPs  and describe a simple new algorithm to obtain good lossy compressions. Belief states constitute a sufficient statistic summarizing all information available to the decision maker (i.e., past actions and observations). However, as long as enough information is available to evaluate the value of each policy, one can still choose the best policy. Since belief states often contain information irrelevant to the estimation of future rewards, one can often compress belief states into some lower-dimensional representation. Let f be a compression function that maps each belief state b into some lower dimensional compressed belief state ~ b (see Figure 1). Here ~ b can be viewed as a bottleneck that filters the information contained in b before it is used to estimate future rewards. We desire a compression f such that ~ b corresponds to the smallest statistic sufficient for accurately predicting the current reward r as well as the next compressed belief state ~ b (since it captures all the information in b necessary to accurately predict subsequent rewards). Such a compression f exists if we can also find compressed transition dynamics ~ T a,z and a compressed reward function ~ R such that: R = ~ R f and f T a,z = ~ T a,z f a A, z Z (3)
Given an f , ~ R and ~ T a,z satisfying Eq. 3, we can evaluate any policy using the compressed POMDP dynamics to obtain ~ V . Since V = ~ V f , the compressed POMDP is equivalent to the original.
When restricting f to be linear (represented by matrix F ), we can rewrite Eq. 3
R = F ~ R and T a,zF = F ~ T a,z a A, z Z (4)
That is, the column space of F spans R and is invariant w.r.t. each T a,z. Hence, the columns of the best linear lossless compression mapping F form a basis for the smallest invariant subspace (w.r.t. each T a,z) that spans R, i.e., the Krylov subspace. We can find the columns of F by Krylov iteration: multiplying R by each T a,z until the newly generated vectors are linear combinations of previous ones.1 The dimensionality of the compressed space is equal to the number of columns of F , which is necessarily smaller than or equal to the dimensionality of the original belief space. Once F is found, we can compute ~ R and each ~ T a,z by solving the system in Eq. 4.
Since linear lossless compressions are not always possible, we can extend the technique of  to find good lossy compressions with early stopping of the Krylov iteration. We retain only the vectors that are "far" from being linear combinations of prior vectors. For instance, if v is a linear combination of v1, v2, . . . , vn, then there are coefficients c1, c2, . . . , cn s.t. the error ||v - c i i vi ||2 is zero. Given a threshold or some upper bound k on the desired number of columns in F , we run Krylov iteration, retaining only the vectors with an error greater than , or the k vectors with largest error. When F is computed by approximate
1For numerical stability, one must orthogonalize each vector before multiplying by T a,z.
Krylov iteration, we cannot compute ~ R and ~ T a,z by solving the linear system in Eq. 4-- due to the lossy nature of the compression, the system is overconstrained. But we can find suitable ~ R and ~ T a,z by computing a least square approximation, solving:
F R = F F ~ R and F T a,zF = F F ~ T a,z a A, z Z
While compression is required when the dimensionality of belief space is too large, unfortu- nately, the columns of F have the same dimensionality. Factored POMDPs of exponential dimension can, however, admit practical Krylov iteration if carried out using a compact representation (e.g., DTs or ADDs) to efficiently compute F , ~ R and each ~ T a,z.
5 Bounded Policy Iteration with Value-Directed Compression
In principle, any POMDP algorithm can be used to solve the compressed POMDPs pro- duced by VDC. If the compression is lossless and the POMDP algorithm exact, the com- puted policy will be optimal for the original POMDP. In practice, POMDP algorithms are usually approximate and lossless compressions are not always possible, so care must be taken to ensure numerical stability and a policy of high quality for the original POMDP. We now discuss some of the integration issues that arise when combining VDC with BPI.
Since V = F ~ V , maximizing the compressed value vector ~ V of some node n automatically maximizes the value V of n w.r.t. the original POMDP when F is nonnegative; hence it is essential that F be nonnegative. Otherwise, the optimal policy of the compressed POMDP may not be optimal for the original POMDP. Fortunately, when R is nonnegative then F is guaranteed to be nonnegative by the nature of Krylov iteration. If some rewards are negative, we can add a sufficiently large constant to R to make it nonnegative without changing the decision problem.
Since most algorithms, including BPI, compute approximately optimal policies it is also critical to normalize the columns of F . Suppose F has two columns f1 and f2 with L1- lengths 1 and 100, respectively. Since V = F ~ V = ~ v1f1 + ~v2f2, changes in ~v2 have a much greater impact on V than changes in ~ v1. Such a difference in sensitivity may bias the search for a good policy to an undesirable region of the belief space, or may even cause the algorithm to return a policy that is far from optimal for the original POMDP despite the fact that it is -optimal for the compressed POMDP.
We note that it is "safer" to evaluate policies iteratively by successive approximation rather than solving the system in Eq. 1. By definition, the transition matrices T a,z have eigen- values with magnitude 1. In contrast, lossy compressed transition matrices ~ T a,z are not guaranteed to have this property. Hence, solving the system in Eq. 1 may not correspond to policy evaluation. It is thus safer to evaluate policies by successive approximation for lossy compressions.
Finally several algorithms including BPI compute witness belief states to verify the domi- nance of a value vector. Since the compressed belief space ~ B is different from the original belief space B, this must be approached with care. B is a simplex corresponding to the convex hull of the state points. In contrast, since each row vector of F is the compressed version of some state point, ~ B corresponds to the convex hull of the row vectors of F . When F is non-negative, it is often possible to ignore this difference. For instance, when verifying the dominance of a value vector, if there is a compressed witness ~ b, there is al- ways an uncompressed witness b, but not vice-versa. This means that we can properly identify all dominating value vectors, but we may erroneously classify a dominated vector as dominating. In practice, this doesn't impact the correctness of algorithms such as policy iteration, bounded policy iteration, incremental pruning, witness algorithm, etc. but it will slow them down since they won't be able to prune as many value vectors as possible.
cycle16 cycle19 cycle22 105 120 100 130 115 95 125 110 90 Expected Rewards Expected Rewards 105 Expected Rewards 120 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes cycle25 3legs16 3legs19 120 135 150 115 130 145 110 125 105 120 140 100 115 Expected Rewards Expected Rewards Expected Rewards 135 95 110 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes 3legs22 3legs25 cycle25 150 12 145 160 10 8 140 155 6 135 150 4 130 145 Expected Rewards Expected Rewards 2 Time (1000 seconds) 125 140 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes
Figure 2: Experimental results for cycle and 3legs network configurations of 16, 19, 22 and 25 machines. The bottom right graph shows the running time of BPI on compressed versions of a cycle network of 25 machines.
3legs cycle 16 19 22 25 16 19 22 25 VDCBPI 120.9 137.0 151.0 164.8 103.9 121.3 134.3 151.4 heuristic 100.6 118.3 138.3 152.3 102.5 117.9 130.2 152.3 doNothing 98.4 112.9 133.5 147.1 91.6 105.4 122.0 140.1
Table 3: Comparison of the best policies achieved by VDCBPI to the doNothing and heuristic policies.
The above tips work well when VDC is integrated with BPI. We believe they are sufficient to ensure proper integration of VDC with other POMDP algorithms, though we haven't verified this empirically.