Proof.
Define the auxiliary operator sequence
![$\displaystyle T_t(Q',Q)(x,a) = \begin{cases}(1-\alpha_t(x,a))Q'(x,a) + \alpha_t...
...), & \text{if $x=x_t$\ and $a=a_t$}, \\ Q'(x,a) & \text{otherwise}, \end{cases}$](img153.gif) |
(6) |
where
![$ y_t$](img55.gif)
is sampled from distribution
![$ P(x_t, a_t,
.)$](img57.gif)
, and
![$ x_{t+1}=\tilde{y}_t$](img147.gif)
. Note that the only difference between
![$ T_t$](img84.gif)
and
![$ \tilde{T}_t$](img151.gif)
is that the successor states (
![$ y_t$](img55.gif)
and
![$ \tilde{y}_t$](img143.gif)
) are sampled from different distributions (
![$ P$](img59.gif)
and
![$ P_t$](img69.gif)
). By Lemma
B.1, we can assume that
![$ \Pr(y_t
= \tilde{y}_t) \ge 1-\varepsilon $](img154.gif)
, and at the same time
![$ y_t$](img55.gif)
is
sampled from
![$ P(x_t, a_t,
.)$](img57.gif)
and
![$ \tilde{y}_t$](img143.gif)
from
![$ P_t(x_t, a_t, .)$](img144.gif)
(
![$ y_t$](img55.gif)
and
![$ \tilde{y}_t$](img143.gif)
are not independent from each
other).
Note also that the definition in Equation 6
differs from that of Equation 5, because
is not necessarily the successor state of
.
Fortunately, conditions of the Robbins-Monro theorem
[Szepesvári(1998)] do not require this fact, so it remains
true that
approximates
at
uniformly w.p.1.
Let
be an arbitrary value function, and let
and
Recall that
![$ \kappa$](img94.gif)
-approximation means that
![$ \limsup_{t\to\infty} \Vert\tilde Q_t - Q^*\Vert \le \kappa$](img159.gif)
w.p.1.
Since
![$ \Vert\tilde Q_t - Q^*\Vert \le \Vert\tilde Q_t - Q_t\Vert + \Vert Q_t -
Q^*\Vert$](img160.gif)
and
![$ \Vert Q_t - Q^*\Vert \to 0$](img161.gif)
w.p.1, it suffices to show that
![$ \limsup_{t\to\infty} \Vert\tilde Q_t - Q_t\Vert \le \kappa$](img162.gif)
w.p.1.
Clearly, for any
,
where we used the shorthand
![$ \alpha_t$](img168.gif)
for
![$ \alpha_t(x_t,a_t)$](img169.gif)
.
Since this holds for every
![$ \vert\tilde Q_{t+1}(x,a) - Q_{t+1}(x,a)\vert$](img170.gif)
,
it also does for their maximum,
![$ \Vert\tilde Q_{t+1} - Q_{t+1}\Vert$](img171.gif)
. As
mentioned before,
![$ \Pr(\tilde y_t = y_t)\ge 1-\varepsilon $](img172.gif)
. If
![$ \tilde{y}_t = y_t$](img173.gif)
,
![$ \Vert Q^*(\tilde y_t, .) - Q^*(y_t,.)\Vert = 0$](img174.gif)
.
Otherwise the only applicable upper bound is
![$ M$](img175.gif)
. Therefore the
following random sequence bounds
![$ \Vert\tilde Q_{t} - Q_{t}\Vert$](img176.gif)
from
above:
![$ a_0 := 0$](img177.gif)
,
![$\displaystyle a_{t+1} := (1-\alpha_t) a_t + \alpha_t h_t,$](img178.gif) |
(7) |
where
It is easy to see that Equation
7 is a
Robbins-Monro-like iterated averaging
[
Robbins and Monro(1951),
Jaakkola et al.(1994)]. Therefore
![$ a_t$](img54.gif)
converges
to
![$ E(h_t) = \gamma M \varepsilon $](img180.gif)
w.p.1.
Consequently,
, which completes the proof.
Theorem 3.5
Let
![$ Q^*$](img47.gif)
be the optimal value function of the base MDP of the
generalized
![$ \varepsilon $](img1.gif)
-MDP, and let
![$ M=\max_{x,a} Q^*(x,a) - \min_{x,a} Q^*(x,a)$](img150.gif)
. If Assumption A holds, then
![$ \limsup_{t\to\infty} \Vert Q_t
- Q^*\Vert \le \frac{2}{1-\gamma}\gamma M \varepsilon $](img182.gif)
w.p.1, i.e., the
sequence
![$ \kappa'$](img118.gif)
-approximates the optimal value function
with
![$ \kappa' = \frac{2}{1-\gamma}\gamma M \varepsilon $](img183.gif)
.
Proof.
By Lemma
3.4, the
operator sequence
![$ \kappa$](img94.gif)
-approximates
![$ K$](img49.gif)
(defined in
Equation
2) at
![$ Q^*$](img47.gif)
with
![$ \kappa = \gamma M
\varepsilon $](img152.gif)
. The
proof can be finished analogously to the proof of
Corollary
A.3: with the substitution
![$ X \leftarrow
X \times A$](img184.gif)
, and defining
![$ G_t$](img185.gif)
and
![$ F_t$](img186.gif)
as in Equations
11
and
12, the conditions of Theorem
3.3
hold, thus proving our statement.