Adaptive Step-size Policy Gradients with Average Reward Metric
Takamitsu Matsubara (NAIST/ATR), Tetsuro Morimura (IBM Research),
and Jun Morimoto (ATR);
JMLR W&P 13:285-298, 2010.
Abstract
In this paper, we propose a novel adaptive step-size approach for policy
gradient reinforcement learning. A new metric is defined for policy
gradients that measures the effect of changes on average reward with
respect to the policy parameters. Since the metric directly measures
the effects on the average reward, the resulting policy gradient learning
employs an adaptive step-size strategy that can effectively avoid
falling into a stagnant phase from the complex structure of the average
reward function with respect to the policy parameters. Two
algorithms are derived with the metric as variants of ordinary and
natural policy gradients. Their properties are compared with previously
proposed policy gradients through numerical experiments with
simple, but non-trivial, 3-state Markov Decision Processes (MDPs).
We also show performance improvements over previous methods in
on-line learning with more challenging 20-state MDPs.