Proof.
Define the auxiliary operator sequence
|
(6) |
where
is sampled from distribution
, and
. Note that the only difference between
and
is that the successor states (
and
) are sampled from different distributions (
and
). By Lemma
B.1, we can assume that
, and at the same time
is
sampled from
and
from
(
and
are not independent from each
other).
Note also that the definition in Equation 6
differs from that of Equation 5, because
is not necessarily the successor state of .
Fortunately, conditions of the Robbins-Monro theorem
[Szepesvári(1998)] do not require this fact, so it remains
true that approximates at uniformly w.p.1.
Let
be an arbitrary value function, and let
and
Recall that
-approximation means that
w.p.1.
Since
and
w.p.1, it suffices to show that
w.p.1.
Clearly, for any ,
where we used the shorthand
for
.
Since this holds for every
,
it also does for their maximum,
. As
mentioned before,
. If
,
.
Otherwise the only applicable upper bound is
. Therefore the
following random sequence bounds
from
above:
,
|
(7) |
where
It is easy to see that Equation
7 is a
Robbins-Monro-like iterated averaging
[
Robbins and Monro(1951),
Jaakkola et al.(1994)]. Therefore
converges
to
w.p.1.
Consequently,
, which completes the proof.
Theorem 3.5
Let
be the optimal value function of the base MDP of the
generalized
-MDP, and let
. If Assumption A holds, then
w.p.1, i.e., the
sequence
-approximates the optimal value function
with
.
Proof.
By Lemma
3.4, the
operator sequence
-approximates
(defined in
Equation
2) at
with
. The
proof can be finished analogously to the proof of
Corollary
A.3: with the substitution
, and defining
and
as in Equations
11
and
12, the conditions of Theorem
3.3
hold, thus proving our statement.