Lemma D.1 (Corollary
4.3)
Assume that the environment is such that
for all
. Let
be a prescribed number. For sufficiently large
and sufficiently small time steps, the SDS controller
described in Equation
10 and the environment form an
-MDP.
Proof.
From [
Szepesvári et al.(1997)] it is known that for
sufficiently fine time steps, the eventual tracking error is
bounded by
, i.e., for sufficiently large
,
For sufficiently large
,
. Therefore for arbitrary value function
we may write
This means that the system is indeed an
-MDP.