Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)
Michael Duff, Andrew Barto
In general, procedures for determining Bayes-optimal adaptive controls for Markov decision processes (MDP's) require a pro(cid:173) hibitive amount of computation-the optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP. Bandit processes constitute an important subclass of MDP's, and have optimal learning strategies (defined in terms of Gittins indices) that can be computed relatively efficiently. Thus, one scheme for achieving approximately-optimal learning for gen(cid:173) eral MDP's proceeds by taking actions suggested by strategies that are optimal with respect to local bandit models.