The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
Shie Mannor, John N. Tsitsiklis; 5(Jun):623--648, 2004.
Abstract
We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given
n arms, a total of
O((
n/ε
2)log(1/δ)) trials suffices
in order to find an
ε-optimal arm with probability at least 1-δ. We establish a matching lower bound
on the expected number of trials under any sampling policy.
We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms.
We also provide a similar bound within a Bayesian setting.
The case where the statistics of the arms are known but the identities of the arms
are not, is also discussed.
For this case, we provide a lower bound
of Θ((1/ε
2)(
n+log(1/δ))) on the expected number of trials,
as well as a sampling policy with a matching
upper bound.
If instead of the expected number of trials, we consider the maximum (over all sample paths)
number of trials, we establish a matching upper and lower bound of the form
Θ((
n/ε
2)log(1/δ)).
Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.
[abs][pdf][ps.gz][ps]