Policy Search using Paired Comparisons
Malcolm J. A. Strens, Andrew W. Moore;
3(Dec):921-950, 2002.
Abstract
Direct policy search is a practical way to solve reinforcement
learning (RL) problems involving continuous state and action
spaces. The goal becomes finding policy parameters that maximize a
noisy objective function. The Pegasus method converts this
stochastic optimization problem into a deterministic one, by using
fixed start states and fixed random number sequences for comparing
policies (Ng and Jordan, 2000). We evaluate Pegasus, and new paired
comparison methods, using the mountain car problem, and a
difficult pursuer-evader problem. We conclude that: (i) paired
tests can improve performance of optimization procedures; (ii)
several methods are available to reduce the 'overfitting' effect
found with Pegasus; (iii) adapting the number of trials used for
each comparison yields faster learning; (iv) pairing also helps
stochastic search methods such as differential evolution.
[abs]
[pdf]
[ps.gz]
[ps]