Journal of Machine Learning Research (3) 2002 145-174 Submitted 10/01; Revised 1/02; Published 8/02

$\varepsilon$ -MDPs: Learning in Varying Environments

István Szita szityu@cs.elte.hu
Bálint Takács deim@inf.elte.hu
András Lőrincz lorincz@inf.elte.hu
Department of Information Systems, Eötvös Loránd University
Pázmány Péter sétány 1/C
Budapest, Hungary H-1117

Editor: Sridhar Mahadevan

Abstract

In this paper $\varepsilon$ -MDP-models are introduced and convergence theorems are proven using the generalized MDP framework of Szepesvári and Littman. Using this model family, we show that Q-learning is capable of finding near-optimal policies in varying environments. The potential of this new family of MDP models is illustrated via a reinforcement learning algorithm called event-learning which separates the optimization of decision making from the controller. We show that event-learning augmented by a particular controller, which gives rise to an $\varepsilon$ -MDP, enables near optimal performance even if considerable and sudden changes may occur in the environment. Illustrations are provided on the two-segment pendulum problem.

$\begin{keywords} reinforcement learning, convergence, event-learning, SARSA, MDP, generalized MDP, $\varepsilon $-MDP, SDS controller \end{keywords}$

Q-learning in Generalized MDPs

MDPs in Varying Environments

Illustration: the Event-learning Algorithm

Computational Demonstrations: The Two-link Pendulum