Next: Q-learning in Generalized -MDPs
Up: MDPs in Varying Environments
Previous: Generalized -MDPs
Asymptotic Boundedness of Value Iteration
In this section we prove a generalized form of the convergence
theorem of Szepesvári and Littman's (for the original theorem, see
Appendix A). We do not require probability one
uniform convergence of the approximating operators, but only a
sufficiently close approximation. Therefore the theorem can be
applied to prove results about algorithms in generalized
-MDPs.
Our definition of closeness both for value functions and
dynamic-programming operators is given below.
Let
be an arbitrary state space and denote by
the set of value functions. Let
be an arbitrary contraction mapping with unique
fixed point
, and let
be a sequence of random
operators.
Definition 3.1
A series of value functions
![$ \kappa$](img94.gif)
-approximates
![$ V$](img95.gif)
with
![$ \kappa>0$](img96.gif)
, if
![$ \limsup_{t\rightarrow\infty} \left\Vert V_t-V\right\Vert \le
\kappa$](img97.gif)
with probability one.
Definition 3.2
We say that
![$ \kappa$](img94.gif)
-approximates
![$ T$](img31.gif)
at
![$ V$](img95.gif)
over
![$ X$](img10.gif)
, if
for any
![$ V_0$](img98.gif)
and for
![$ V_{t+1}=T_t(V_t,V)$](img99.gif)
,
![$ \kappa$](img94.gif)
-approximates
![$ T V$](img100.gif)
over
![$ X$](img10.gif)
with probability one.
Note that
may depend on the approximated value function
,
unlike the previous example in Equation 4.
-approximation of value functions is, indeed, weaker (more
general) than probability one uniform convergence: the latter
means that for all
there exists a
such
that
whereas an equivalent form of
-approximation is that for
all
there exists a
such that
and
is fixed.
Theorem 3.3
Let
![$ T$](img31.gif)
be an arbitrary mapping with fixed point
![$ V^*$](img30.gif)
, and let
![$ \kappa$](img94.gif)
-approximate
![$ T$](img31.gif)
at
![$ V^*$](img30.gif)
over
![$ X$](img10.gif)
. Let
![$ V_0$](img98.gif)
be an
arbitrary value function, and define
![$ V_{t+1} = T_t(V_t, V_t)$](img105.gif)
. If
there exist functions
![$ 0 \leq F_t(x) \leq 1$](img106.gif)
and
![$ 0 \leq G_t(x)
\leq 1$](img107.gif)
satisfying the conditions below with probability one
- for all
and all
,
- for all
and all
,
- for all
,
converges to zero uniformly in
as
increases;2 and,
- there exists
such that for all
and sufficiently large
,
then
![$ \kappa'$](img118.gif)
-approximates
![$ V^*$](img30.gif)
over
![$ X$](img10.gif)
, where
![$ \kappa'
= \frac{2}{1-\gamma} \kappa$](img119.gif)
.
Proof.
The proof is similar to that of the original theorem. First we
define the sequence of auxiliary functions
![$ U_t$](img121.gif)
by the recursion
![$ U_0 = V_0$](img122.gif)
,
![$ U_{t+1} = T_t (U_t, V^*)$](img123.gif)
. Since
![$ \kappa$](img94.gif)
-approximates
![$ T$](img31.gif)
at
![$ V^*$](img30.gif)
,
![$ \limsup_{t\to\infty} \Vert U_t -
V^* \Vert \le \kappa$](img124.gif)
, i.e., for sufficiently large
![$ t_0$](img125.gif)
and
![$ t \ge
t_0$](img126.gif)
,
![$ \Vert U_t - V^* \Vert \le \kappa$](img127.gif)
. Let
For
![$ t \ge
t_0$](img126.gif)
we have
Then, by Lemma
C.1 (found in the appendix), we get
that
![$ \limsup_{t\to\infty} \Vert \delta_t \Vert \le
\frac{1+\gamma}{1-\gamma} \kappa$](img138.gif)
. From this,
![$ \limsup_{t\to\infty} \Vert V_t - V^* \Vert \le \frac{1+\gamma}{1-\gamma}
\kappa + \kappa = \frac{2}{1-\gamma} \kappa$](img139.gif)
.
Next: Q-learning in Generalized -MDPs
Up: MDPs in Varying Environments
Previous: Generalized -MDPs