The Convergence of the Generalized Q-learning Algorithm

It is an easy consequence of Theorem A.1 that under appropriate conditions, the Q-learning algorithm is convergent.

Theorem A.3 If Assumption A holds, and the learning rates satisfy $\sum_{t=0}^{\infty} \chi(x_t\!=\!x, a_t\!=\!a) \alpha_t(x,a) = \infty$ and $\sum_{t=0}^{\infty} \chi(x_t\!=\!x, a_t\!=\!a) \alpha_t(x,a)^2 < \infty$ uniformly w.p.1, then the generalized Q-learning algorithm described by iteration (Equation 3) converges to

w.p.1 uniformly over $X \times A$ .

Proof. The theorem is an easy consequence of Theorem A.1 with the appropriate substitutions and assignments:

Consider the substitution $X \leftarrow X \times A$ , $V^* \leftarrow Q^*$ and $V_t \leftarrow Q_t$ . Furthermore, let the randomized approximate dynamic programming operator sequence defined by

$\displaystyle T_t(Q',Q)(x,a) = \begin{cases}(1-\alpha_t(x,a))Q'(x,a) + \alpha_t... ...), & \text{if $x=x_t$\ and $a=a_t$}, \\ Q'(x,a) & \text{otherwise}, \end{cases}$

which approximates

w.p.1 over $X \times A$ under the assumptions (2)-(4) by the well-known Robbins-Monro theorem [Robbins and Monro(1951)].

Finally, let

$\displaystyle G_t(x,a)= \begin{cases}1-\alpha_t(x,a), & \text{if $x=x_t$\ and $a=a_t$}, \\ 1, & \text{otherwise} \end{cases}$

(11)

and

$\displaystyle F_t(x,a)= \begin{cases}\gamma\alpha_t(x,a), & \text{if $x=x_t$\ and $a=a_t$}, \\ 0, & \text{otherwise.} \end{cases}$

(12)

Condition (4) of Theorem A.1 trivially holds, while conditions (1) and (2) are implied by the definition of and the non-expansion property of ${\textstyle\bigotimes}$ . Finally, condition (3) is a consequence of assumption (3) and the definition of .

Applying Theorem A.1 proves the statement. $\qedsymbol$