Convergence to the neighborhood of the optimal value function

Convergence properties of the event-value function for the two-link pendulum are shown in Figure 5. The experiment concerns crude discretization of the state space. No change of the parameters of the pendulum are made. However, crude discretization of the environment and a robust controller, which is part of the environment, exhibits itself as a varying environment.

The theorems of Section 3 concern supremum norm. Two curves about the supremum norm are shown in Figure 5A, one with the SDS controller turned off ( $\Lambda=0$ ) and another one with the SDS controller on $\Lambda =1.5$ . Convergence occurs for learning with the SDS controller `on'.¹⁰ Interestingly, convergence is faster with the SDS controller than without it. This is a consequence of the larger variety of actions available when the robust controller is applied.

The square norm against the last event-value function of this series of experiments (Figure 5B) may provide insight into the performance of the two-link pendulum. The performance of the pendulum can be characterized by the average task duration and the standard deviation of task duration during the course of learning (Figure 5C). There is a clear advantage for the $\Lambda =1.5$ case against learning without the robust controller.

**Figure 5:** **Convergence of Value Iteration with and without a Robust Controller**
Horizontal axis indicates `pendulum time' in seconds. Circles/dotted line: without robust controller ( $\Lambda =0.0$ ), squares/continuous line: with robust controller ( $\Lambda =1.5$ )
A: Convergence of event value functions in supremum norm. Supremum norm was computed against event values belonging to the last time step. Supremum norm, in turn, is zero for the last point (not shown).
B: Convergence of event value functions in square norm. Square norm was computed by summing about terms. Square norm was computed against event values belonging to the last time step. Square norm values beyond time $9 \times 10^5$ are underestimated.
C: Average time of performing task and standard deviation of the duration of these trials during course of learning. Note the different scales of the left hand side and the right hand side sub-figures.