As shown by Szepesvári and Littman, the analogue of the Q-learning algorithm (Section 2.1) can be defined in generalized MDPs as well [Szepesvári and Littman(1996)]. Furthermore, convergence results for this general algorithm can also be established.
In this subsection we review the generalized algorithm. We
restrict ourselves to models where operator
is the
expected value operator, i.e.,
. This simplifies the definition considerably,
and for the purposes of this article this special case is
sufficient. The general definition can be found in the work of
[Szepesvári and Littman(1996)].
In Q-learning, for the optimal state-action value function ,
holds. Note that
is the
fixed point of operator
, which is defined by
The generalized Q-learning algorithm starts with an arbitrary
initial value function , and then uses the following update
rule:
It has been proven that under appropriate conditions on the order
of updates and the learning parameters
, the above
Q-learning algorithm converges to the optimal Q-value function.
The proof is based on a theorem that states the convergence of a
specific asynchronous stochastic approximation. Both theorems are
cited in Appendix A.