Similarly to most other RL algorithms, the E-learning algorithm [Lorincz et al.(2002)] also uses a value function, the event-value function . Pairs of states and are called events and desired events, respectively. For a given initial state , let us denote the desired next state by . The state sequence is the desired event, or subtask, is the value of trying to get from actual state to next state . State reached by the subtask could be different from the desired state . One of the advantages of this formulation is that one may--but does not have to--specify the transition time: Realizing the subtask may take more than one step for the controller, which is working in the background.
This value may be different from the expected discounted total reward of eventually getting from to . We use the former definition, since we want to use the event-value function for finding an optimal successor state. To this end, the event-selection policy is introduced. gives the probability of selecting desired state in state . However, the system usually cannot be controlled by ``wishes" (desired new states); decisions have to be expressed in actions. This is done by the action-selection policy (or controller policy) , where gives the probability that the agent selects action to realize the transition .4
An important property of event learning is the following: only the event-selection policy is learned (through the event-value function) and the learning problem of the controller's policy is separated from E-learning. From the viewpoint of E-learning, the controller policy is part of the environment, just like the transition probabilities.
The event-value function corresponding to a given action selection policy can be expressed by the state value function:
It can be shown that
satisfies the following
Bellman-type equations:
where | |||
An optimal event-value function with respect to an optimal controller policy will be denoted by .