Markov Decision Processes with the Expected Reward Criterion

Next: Generalized Markov Decision Processes Up: Preliminaries Previous: Preliminaries

Markov Decision Processes with the Expected Reward Criterion

The ultimate goal of decision making is to find an optimal behavior subject to some optimality criterion. Optimizing for the infinite-horizon expected discounted total reward is one of the most studied such criteria. Under this criterion, we are trying to find a policy that maximizes the expected value of $\sum_{t=0}^{\infty} \gamma^t r_t$ , where is the immediate reward in time step and $0 \le \gamma < 1$ is the discount factor.

A standard way to find an optimal policy is to compute the optimal value function $V^*: X \to \mathbb{R}$ , which gives the value of each state (the expected cumulated discounted reward with the given starting state). From this, the optimal policy can be easily obtained: the `greedy' policy with respect to the optimal value function is an optimal policy (see e.g. [Sutton and Barto(1998)]). This is the well-known value-function approach [Bellman(1957)].

Formally, the optimal value function satisfies the following recursive system of equations known as Bellman equations [Bellman(1957)]:

$\displaystyle V^*(x) = \max_a \sum_y P(x,a,y) \left( R(x,a,y) + \gamma V^*(y) \right), \quad\textrm{for all } x \in X.$

(1)

Besides the state-value function, some other types of value functions can be defined as well. One example is the state-action-value function $Q^*: X \times A \to \mathbb{R}$ , which satisfies

$\displaystyle Q^*(x,a) = \sum_y P(x,a,y) \left( R(x,a,y) + \gamma \max_{a'} Q^*(y,a') \right), \quad\textrm{for all } x \in X$

in the optimal case. has the meaning ``the expected value of taking action in state while following the optimal policy''. Here the values of state-action pairs are learned instead of state values, which enables model-free learning [Watkins(1989)]. The corresponding learning algorithm is called Q-learning.

It is well known that for $\gamma < 1$ , Equation 1 has a unique solution. The Bellman equations can be rewritten using operators: is the (unique) fixed point of operator (called the dynamic programming operator), where

$\displaystyle [TV](x) = \max_a \sum_y P(x,a,y) \left( R(x,a,y) + \gamma V(y) \right).$

Since

is a contraction in max-norm,

can be approximated by iteratively applying

to an arbitrary initial value function.

Next: Generalized Markov Decision Processes Up: Preliminaries Previous: Preliminaries