## Normal Bandits of Unknown Means and Variances

*Wesley Cowan, Junya Honda, Michael N. Katehakis*; 18(154):1−28, 2018.

### Abstract

Consider the problem of sampling sequentially from a finite
number of $N \geq 2$ populations, specified by random variables
$X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where
$X^i_k$ denotes the outcome from population $i$ the $k^{th}$
time it is sampled. It is assumed that for each fixed $i$, $\{
X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. normal random
variables, with unknown mean $\mu_i$ and unknown variance
$\sigma_i^2$. The objective is to have a policy $\pi$ for
deciding from which of the $N$ populations to sample from at any
time $t=1,2,\ldots$ so as to maximize the expected sum of
outcomes of $n$ total samples or equivalently to minimize the
regret due to lack on information of the parameters $\mu_i$ and
$\sigma_i^2$. In this paper, we present a simple inflated sample
mean (ISM) index policy that is asymptotically optimal in the
sense of Theorem 4 below. This resolves a standing open problem
from \cite{bkmab96}. Additionally, finite horizon regret bounds
are given.

[abs][pdf][bib]