In a common formulation of the reinforcement learning (RL) problem an agent improves its behavior by observing the outcomes of its own interactions with the environment. In the 1980's, Markovian decision problems (MDPs) were proposed as a model for the analysis of RL (for an overview, see [Sutton and Barto(1998)] and references therein), and since then a mathematically well-founded theory has been constructed for a large class of RL algorithms [Watkins(1989),Watkins and Dayan(1992),Tsitsiklis(1994),Gullapalli and Barto(1994),Jaakkola et al.(1994)]. RL algorithms typically consider stationary environments. Quasi-stationary environments are approached by continually adapting the agent. However, changes in the environment may be very fast and could prohibit optimizations for methods that assume a stationary environment.
To provide a principled framework for RL in fast changing environments, we introduce a model called -MDP, a generalization of -stationary MDP [Kalmár et al.(1998)]. In this novel MDP concept the environment is allowed to change over time, provided that the accumulated changes remain bounded. In particular, transition probabilities may vary as a function of time. The only requirement is that the change is finite but small (it is bounded by a small number ). We cannot expect to find an optimal policy; it may not even exist for this case. Nevertheless, we prove the following important result using the generalized MDP framework of [Szepesvári and Littman(1996)]: if a reinforcement learning algorithm that approximates the optimal value function converges to the optimal value function in the MDP model, then in the corresponding -MDP the asymptotic distance of the optimal value function and its approximation is bounded, and the bound is proportional to under the same conditions. The main result of our paper is as follows: If an RL algorithm can learn an optimal policy in an MDP, then it is capable of learning a near-optimal policy in an -MDP as well.
We shall illustrate -MDP in conjunction with a novel reinforcement learning algorithm called event-learning [Lorincz et al.(2002)]. In typical RL formulations, the agent learns an optimal policy that prescribes the optimal action for any given state. This kind of policy has much the same advantages and drawbacks as conditioned reflexes: it can solve difficult tasks, but it may be sensitive to minor changes in the environment. Furthermore, new parameter settings for the same problem may involve having to restart learning from the beginning. In event-learning, the policy selects desired successor states instead of selecting actions. Consequently, not state-action values but the values of state-state pairs (events) are learned. The task of bringing the agent to the desired successor state is passed to a lower-level controller. It has been shown elsewhere [Szita et al.(2002)] that event-learning with a particular non-Markovian controller, the SDS controller [Szepesvári and Lorincz(1997)], belongs to the -MDP problem family under certain conditions. Convergence of learning of -MDP s is studied via computer simulations on the two-link pendulum problem. These simulations intend to demonstrate that -MDPs can be applied, e.g., to various models with uncertain and/or noisy state description.
We will argue that -MDPs can be related to RL methods making use of control ideas [Doya(1996),Doya(2000),ten Hagen(2001)] and to module-based RL [Maes(1992),Mahadevan and Connell(1992),Mataric(1997),Kalmár et al.(1998)].
The article is organized as follows. Section 2 provides an overview of MDPs and generalized MDPs. We introduce the concept of generalized -MDPs and the appropriate generalized Q-learning in Section 3 and prove the main `convergence' theorem for generalized -MDPs. In Section 4 an overview of event-learning is provided. Section 5 contains illustrations of the -MDP model within the event-learning framework using the two-link pendulum problem. Conclusions are drawn in Section 6. The paper concludes with an appendix providing mathematical details on -MDPs, including the proof that event-learning algorithm augmented with an adapting non-Markovian controller can form an -MDP problem.