Processing math: 100%

Normal Bandits of Unknown Means and Variances

Wesley Cowan, Junya Honda, Michael N. Katehakis.

Year: 2018, Volume: 18, Issue: 154, Pages: 1−28


Abstract

Consider the problem of sampling sequentially from a finite number of N2 populations, specified by random variables Xik, i=1,,N, and k=1,2,; where Xik denotes the outcome from population i the kth time it is sampled. It is assumed that for each fixed i, {Xik}k1 is a sequence of i.i.d. normal random variables, with unknown mean μi and unknown variance σ2i. The objective is to have a policy π for deciding from which of the N populations to sample from at any time t=1,2, so as to maximize the expected sum of outcomes of n total samples or equivalently to minimize the regret due to lack on information of the parameters μi and σ2i. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from \cite{bkmab96}. Additionally, finite horizon regret bounds are given.

PDF BibTeX