Part of Advances in Neural Information Processing Systems 6 (NIPS 1993)
Tommi Jaakkola, Michael Jordan, Satinder Singh
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learn(cid:173) ing problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DP-based learning algorithms to the pow(cid:173) erful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD(") and Q-Iearning belong.