Part of Advances in Neural Information Processing Systems 10 (NIPS 1997)

*Rémi Munos, Paul Bourgine*

This paper is concerned with the problem of Reinforcement Learn(cid:173) ing (RL) for continuous state space and time stocha.stic control problems. We state the Harnilton-Jacobi-Bellman equation satis(cid:173) fied by the value function and use a Finite-Difference method for designing a convergent approximation scheme. Then we propose a RL algorithm based on this scheme and prove its convergence to the optimal solution.

1

Introduction to RL in the continuous, stochastic case

The objective of RL is to find -thanks to a reinforcement signal- an optimal strategy for solving a dynamical control problem. Here we sudy the continuous time, con(cid:173) tinuous state-space stochastic case, which covers a wide variety of control problems including target, viability, optimization problems (see [FS93], [KP95])}or which a formalism is the following. The evolution of the current state x(t) E 0 (the state(cid:173) space, with 0 open subset of IRd ), depends on the control u(t) E U (compact subset) by a stochastic differential equation, called the state dynamics:

dx = f(x(t), u(t))dt + a(x(t), u(t))dw

(1) where f is the local drift and a .dw (with w a brownian motion of dimension rand (j a d x r-matrix) the stochastic part (which appears for several reasons such as lake of precision, noisy influence, random fluctuations) of the diffusion process. For initial state x and control u(t), (1) leads to an infinity of possible traj~tories x(t). For some trajectory x(t) (see figure I)., let T be its exit time from 0 (with the convention that if x(t) always stays in 0, then T = 00). Then, we define the functional J of initial state x and control u(.) as the expectation for all trajectories of the discounted cumulative reinforcement :

J(x; u(.)) = Ex,u( .) {loT '/r(x(t), u(t))dt +,,{ R(X(T))}

Do not remove: This comment is monitored to verify that the site is working properly