Least-Squares Policy Iteration
Michail G. Lagoudakis, Ronald Parr; 4(Dec):1107-1149, 2003.
Abstract
We propose a new approach to reinforcement learning for control
problems which combines value-function approximation with linear
architectures and approximate policy iteration. This new approach is
motivated by the least-squares temporal-difference learning algorithm
(LSTD) for prediction problems, which is known for its efficient use
of sample experiences compared to pure temporal-difference
algorithms. Heretofore, LSTD has not had a straightforward application
to control problems mainly because LSTD learns the state value
function of a fixed policy which cannot be used for action selection
and control without a model of the underlying process. Our new
algorithm, least-squares policy iteration (LSPI), learns the
state-action value function which allows for action selection without
a model and for incremental policy improvement within a
policy-iteration framework. LSPI is a model-free, off-policy method
which can use efficiently (and reuse in each iteration) sample
experiences collected in any manner. By separating the
sample collection method, the choice of the linear approximation
architecture, and the solution method, LSPI allows for focused
attention on the distinct elements that contribute to practical
reinforcement learning. LSPI is tested on the simple task of
balancing an inverted pendulum and the harder task of balancing and
riding a bicycle to a target location. In both cases, LSPI learns to
control the pendulum or the bicycle by merely observing a relatively
small number of trials where actions are selected randomly. LSPI is
also compared against
Q-learning (both with and without experience
replay) using the same value function architecture. While LSPI
achieves good performance fairly consistently on the difficult bicycle
task,
Q-learning variants were rarely able to balance for more than
a small fraction of the time needed to reach the target location.
[abs][pdf][ps.gz][ps]