To begin with, we recall the definition of a Markov Decision
Process (MDP) [Puterman(1994)]. A (finite) MDP is defined
by the tuple
, where
and
denotes the finite set of states and actions, respectively.
is called the transition
function, since
gives the probability of arriving at
state
after executing action
in state
. Finally,
is the reward function,
gives the immediate reward for the transition
.