On the Convergence of Optimistic Policy Iteration
John N. Tsitsiklis;
3(Jul):59-72, 2002.
Abstract
We consider a finite-state Markov decision problem and
establish the convergence of a special case of
optimistic policy iteration that involves Monte Carlo estimation
of
Q-values, in conjunction with greedy policy selection.
We provide convergence results for a number of algorithmic variations,
including one that
involves temporal difference learning (bootstrapping) instead of Monte Carlo
estimation. We also indicate some extensions that either fail or are unlikely
to go through.
[abs]
[pdf]
[ps.gz]
[ps]