Q-learning in Generalized MDPs

Next: MDPs in Varying Environments Up: Generalized Markov Decision Processes Previous: Generalized Markov Decision Processes

Q-learning in Generalized MDPs

As shown by Szepesvári and Littman, the analogue of the Q-learning algorithm (Section 2.1) can be defined in generalized MDPs as well [Szepesvári and Littman(1996)]. Furthermore, convergence results for this general algorithm can also be established.

In this subsection we review the generalized algorithm. We restrict ourselves to models where operator ${\textstyle\bigoplus}$ is the expected value operator, i.e., $({\textstyle\bigoplus}g) (x,a) = \sum_y P(x,a,y) g(x,a,y)$ . This simplifies the definition considerably, and for the purposes of this article this special case is sufficient. The general definition can be found in the work of [Szepesvári and Littman(1996)].

In Q-learning, for the optimal state-action value function , $Q^* = {\textstyle\bigoplus}(R + \gamma V^*)$ holds. Note that is the fixed point of operator , which is defined by

$\displaystyle KQ = {\textstyle\bigoplus}(R + \gamma {\textstyle\bigotimes}Q).$

(2)

The generalized Q-learning algorithm starts with an arbitrary initial value function , and then uses the following update rule:

$\displaystyle Q_{t+1}(x_t,a_t) = (1-\alpha_t(x_t,a_t))Q_t(x_t,a_t) + \alpha_t(x_t,a_t)(r_t+\gamma ({\textstyle\bigotimes}Q_t)(y_t)),$

(3)

where

is the current state,

is the selected action,

is the resulting state,

is the gained reward, and $\alpha_t(x,a)$ is the learning rate at time

is selected according to the probability distribution

and

is the actual estimate of

It has been proven that under appropriate conditions on the order of updates and the learning parameters $\alpha_t(x,a)$ , the above Q-learning algorithm converges to the optimal Q-value function. The proof is based on a theorem that states the convergence of a specific asynchronous stochastic approximation. Both theorems are cited in Appendix A.

Next: MDPs in Varying Environments Up: Generalized Markov Decision Processes Previous: Generalized Markov Decision Processes