Near-optimal Regret Bounds for Reinforcement Learning
Thomas Jaksch, Ronald Ortner, Peter Auer; 11(51):1563−1600, 2010.
Abstract
For undiscounted reinforcement learning in Markov decision
processes (MDPs) we consider the total regret of
a learning algorithm with respect to an optimal policy.
In order to describe the transition structure of an MDP we propose a new parameter:
An MDP has diameter D if for any pair of states s,s' there is
a policy which moves from s to s' in at most D steps (on average).
We present a reinforcement learning algorithm with total regret
Õ(DS√AT) after T steps for any unknown MDP
with S states, A actions per state, and diameter D.
A corresponding lower bound of Ω(√DSAT) on the
total regret of any learning algorithm is given as well.
These results are complemented by a sample complexity bound on the
number of suboptimal steps taken by our algorithm. This bound can be
used to achieve a (gap-dependent) regret bound that is logarithmic in T.
Finally, we also consider a setting where the MDP is allowed to change
a fixed number of l times. We present a modification of our algorithm
that is able to deal with this setting and show a regret bound of
Õ(l1/3T2/3DS√A).
[abs]
[pdf][bib]© JMLR 2010. (edit, beta) |