Next: Q-learning in Generalized -MDPs
Up: MDPs in Varying Environments
Previous: Generalized -MDPs
Asymptotic Boundedness of Value Iteration
In this section we prove a generalized form of the convergence
theorem of Szepesvári and Littman's (for the original theorem, see
Appendix A). We do not require probability one
uniform convergence of the approximating operators, but only a
sufficiently close approximation. Therefore the theorem can be
applied to prove results about algorithms in generalized
-MDPs.
Our definition of closeness both for value functions and
dynamic-programming operators is given below.
Let be an arbitrary state space and denote by
the set of value functions. Let
be an arbitrary contraction mapping with unique
fixed point , and let
be a sequence of random
operators.
Definition 3.1
A series of value functions
-approximates
with
, if
with probability one.
Definition 3.2
We say that
-approximates
at
over
, if
for any
and for
,
-approximates
over
with probability one.
Note that may depend on the approximated value function ,
unlike the previous example in Equation 4.
-approximation of value functions is, indeed, weaker (more
general) than probability one uniform convergence: the latter
means that for all
there exists a such
that
whereas an equivalent form of -approximation is that for
all
there exists a such that
and is fixed.
Theorem 3.3
Let
be an arbitrary mapping with fixed point
, and let
-approximate
at
over
. Let
be an
arbitrary value function, and define
. If
there exist functions
and
satisfying the conditions below with probability one
- for all
and all ,
- for all
and all ,
- for all ,
converges to zero uniformly in as increases;2 and,
- there exists
such that for all and sufficiently large ,
then
-approximates
over
, where
.
Proof.
The proof is similar to that of the original theorem. First we
define the sequence of auxiliary functions
by the recursion
,
. Since
-approximates
at
,
, i.e., for sufficiently large
and
,
. Let
For
we have
Then, by Lemma
C.1 (found in the appendix), we get
that
. From this,
.
Next: Q-learning in Generalized -MDPs
Up: MDPs in Varying Environments
Previous: Generalized -MDPs