It is an easy consequence of Theorem A.1 that under appropriate conditions, the Q-learning algorithm is convergent.
Consider the substitution
,
and
. Furthermore, let the
randomized approximate dynamic programming operator sequence
defined by
![]() |
Finally, let
Condition (4) of Theorem A.1 trivially holds, while
conditions (1) and (2) are implied by the definition of and
the non-expansion property of
. Finally, condition (3) is a
consequence of assumption (3) and the definition of
.
Applying Theorem A.1 proves the statement.