R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
Ronen I. Brafman, Moshe Tennenholtz;
3(Oct):213-231, 2002.
Abstract
R-MAX is a very simple model-based reinforcement learning
algorithm which can attain near-optimal average reward in polynomial
time. In R-MAX, the agent always maintains a complete, but possibly
inaccurate model of its environment and acts based on the optimal
policy derived from this model. The model is initialized in an
optimistic fashion: all actions in all states return the maximal
possible reward (hence the name). During execution, it is updated
based on the agent's observations. R-MAX improves upon several
previous algorithms: (1) It is simpler and more general than Kearns
and Singh's E^3 algorithm, covering zero-sum stochastic
games. (2) It has a built-in mechanism for resolving
the exploration vs. exploitation dilemma. (3) It formally
justifies the ``optimism under uncertainty'' bias used in many RL
algorithms.
(4) It is simpler, more general, and more efficient than Brafman
and Tennenholtz's LSG algorithm for learning in single controller
stochastic games. (5) It generalizes the algorithm by Monderer and
Tennenholtz for learning in repeated games.
(6) It is the only algorithm for learning in repeated games, to date, which
is provably efficient, considerably improving and simplifying previous
algorithms by Banos and by Megiddo.
[abs]
[pdf]
[ps.gz]
[ps]