Part of Advances in Neural Information Processing Systems 6 (NIPS 1993)
Tommi Jaakkola, Michael I. Jordan, Satinder P. Singh
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learn(cid:173) ing problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DP-based learning algorithms to the pow(cid:173) erful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD(") and Q-Iearning belong.