J. Pinto, A. Fern,
T. Bauer & M. Erwig; JMLR W&CP 20:1–18, 2011.Keywords: side information, policy
gradient RL, adaptation-based programming
Improving Policy Gradient Estimates with Inﬂuence Information
In reinforcement learning (RL) it is often possible to obtain sound, but incomplete,
information about inﬂuences and independencies among problem variables and rewards, even when an
exact domain model is unknown. For example, such information can be computed based on a partial,
qualitative domain model, or via domain-speciﬁc analysis techniques. While, intuitively, such
information appears useful for RL, there are no algorithms that incorporate it in a sound way.
In this work, we describe how to leverage such information for improving the estimation of
policy gradients, which can be used to speedup gradient-based RL. We prove general conditions
under which our estimator is unbiased and show that it will typically have reduced variance
compared to standard unbiased gradient estimates. We evaluate the approach in the domain of
Adaptation-Based Programming where RL is used to optimize the performance of programs
and independence information can be computed via standard program analysis techniques.
Incorporating independence information produces a large speedup in learning on a variety of adaptive
Page last modified on Sun Nov 6 15:41:52 2011.