Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)
Stephan Pareigis
Reinforcement learning methods for discrete and semi-Markov de(cid:173) cision problems such as Real-Time Dynamic Programming can be generalized for Controlled Diffusion Processes. The optimal control problem reduces to a boundary value problem for a fully nonlinear second-order elliptic differential equation of Hamilton(cid:173) Jacobi-Bellman (HJB-) type. Numerical analysis provides multi(cid:173) grid methods for this kind of equation. In the case of Learning Con(cid:173) trol, however, the systems of equations on the various grid-levels are obtained using observed information (transitions and local cost). To ensure consistency, special attention needs to be directed to(cid:173) ward the type of time and space discretization during the obser(cid:173) vation. An algorithm for multi-grid observation is proposed. The multi-grid algorithm is demonstrated on a simple queuing problem.