Multidimensional Triangulation and Interpolation for Reinforcement Learning

Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)

Bibtex Metadata Paper

Authors

Scott Davies

Abstract

Dynamic Programming, Q-Iearning and other discrete Markov Decision Process solvers can be -applied to continuous d-dimensional state-spaces by quantizing the state space into an array of boxes. This is often problematic above two dimensions: a coarse quantization can lead to poor policies, and fine quantization is too expensive. Possible solutions are variable-resolution discretization, or function approximation by neural nets. A third option, which has been little studied in the reinforcement learning literature, is interpolation on a coarse grid. In this paper we study interpolation tech(cid:173) niques that can result in vast improvements in the online behavior of the resulting control systems: multilinear interpolation, and an interpolation algorithm based on an interesting regular triangulation of d-dimensional space. We adapt these interpolators under three reinforcement learning paradigms: (i) offline value iteration with a known model, (ii) Q-Iearning, and (iii) online value iteration with a previously unknown model learned from data. We describe empirical results, and the resulting implications for practical learning of continuous non-linear dynamic control.

1 GRID-BASED INTERPOLATION TECHNIQUES Reinforcement learning algorithms generate functions that map states to "cost-t

• Fine grids may be used in one or two dimensions. Above two dimensions, fine grids are too expensive. Value functions can be discontinuous, which (as we will see) can lead to su boptimalities even with very fine discretization in two dimensions .

• Neural nets have been used in conjunction with TD [Sutton, 1988] and Q-Iearning [Watkins, 1989] in very high dimensional spaces [Tesauro, 1991, Crites and Barto, 1996]. While promising, it is not always clear that they produce the accurate value functions that might be needed for fine near(cid:173) optimal control of dynamic systems, and the most commonly used methods of applying value iteration or policy iteration with a neural-net value func(cid:173) tion are often unstable. [Boyan and Moore, 1995].