To begin with, we recall the definition of a Markov Decision Process (MDP) [Puterman(1994)]. A (finite) MDP is defined by the tuple , where and denotes the finite set of states and actions, respectively. is called the transition function, since gives the probability of arriving at state after executing action in state . Finally, is the reward function, gives the immediate reward for the transition .