The ultimate goal of decision making is to find an optimal behavior subject to some optimality criterion. Optimizing for the infinite-horizon expected discounted total reward is one of the most studied such criteria. Under this criterion, we are trying to find a policy that maximizes the expected value of , where is the immediate reward in time step and is the discount factor.
A standard way to find an optimal policy is to compute the optimal value function , which gives the value of each state (the expected cumulated discounted reward with the given starting state). From this, the optimal policy can be easily obtained: the `greedy' policy with respect to the optimal value function is an optimal policy (see e.g. [Sutton and Barto(1998)]). This is the well-known value-function approach [Bellman(1957)].
Formally, the optimal value function satisfies the following recursive system of equations known as Bellman equations [Bellman(1957)]:
Besides the state-value function, some other types of value functions can be defined as well. One example is the state-action-value function , which satisfies
in the optimal case. has the meaning ``the expected value of taking action in state while following the optimal policy''. Here the values of state-action pairs are learned instead of state values, which enables model-free learning [Watkins(1989)]. The corresponding learning algorithm is called Q-learning.
It is well known that for , Equation 1 has a unique solution. The Bellman equations can be rewritten using operators: is the (unique) fixed point of operator (called the dynamic programming operator), where