Contextual Bandit Learning with Predictable Rewards
Alekh Agarwal, Miroslav Dudik, Satyen Kale, John Langford, Robert Schapire ; JMLR W&CP 22: 19-26, 2012.
Abstract
Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a realizability assumption: there exists a function in a (known) function class, always capable of predicting the expected reward, given the action and context. Under this assumption, we show three things. We present a new algorithm--Regressor Elimination -- with a regret similar to the agnostic setting (i.e. in the absence of realizability assumption). We prove a new lower bound showing no algorithm can achieve superior performance in the worst case even with the realizability assumption. However, we do show that for \emph{any} set of policies (mapping contexts to actions), there is a distribution over rewards (given context) such that our new algorithm has {\em constant} regret unlike the previous approaches.
