In this section we propose an extension of the MDP concept, where transition probabilities are allowed to vary by time. However, without restrictions, such a model would be too general to establish useful theorems. Therefore we restrict ourselves to cases when the variation over time remains small.
We say that the distance of two transition functions and is -small ( ), if for all , i.e., for all and subscript denotes the norm. (Note that for a given state and action , is a probability distribution over .)
A tuple is an -stationary MDP [Kalmár et al.(1998)] with , if there exists an MDP (called the base MDP) such that the difference of the transition functions and is -small for all .
The simplest example of an -MDP is possibly an ordinary MDP, superimposed by additive noise in its transition function as in each step, where is a small amount of noise. A more general case will be discussed within the event-learning framework.
In the ordinary MDP model, the dynamic programming operator is used to find the optimal value function . at time step is given by
Of course, the iteration does not necessarily have a fixed point. The most that we can expect to find is a close approximation of the optimal value function of the base MDP, i.e., a such that with some .1
In Section 3.2 we show that such an approximation with can be found (Theorem 3.3).